Accessing grouping variables in purrr::map() with nested dataframes - r

I'm using tidyr::nest() in combination with purrr::map() (-family) to group a data.frame into groups and then do some fancy stuff with each subset. Consider following example, and please ignore the fact that I don't need nest() and map() to do this (this is an oversimplified example):
library(dplyr)
library(purrr)
library(tidyr)
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(
wt_mean = map_dbl(data,~mean(.x$wt))
)
# A tibble: 8 x 4
cyl gear data cly2
<dbl> <dbl> <list> <dbl>
1 6 4 <tibble [4 x 9]> 6
2 4 4 <tibble [8 x 9]> 4
3 6 3 <tibble [2 x 9]> 6
4 8 3 <tibble [12 x 9]> 8
5 4 3 <tibble [1 x 9]> 4
6 4 5 <tibble [2 x 9]> 4
7 8 5 <tibble [2 x 9]> 8
8 6 5 <tibble [1 x 9]> 6
Usually when I do this type of operation, I need access to the grouping variable (cyl in this case) within map(). But these grouping variables appear as vectors with length corresponding to the number of rows in the nested dataframe, and therefore don't lend themselves easily.
Is there a way I could run the following operation? I would want the mean of wt to be divided by the number of cylinders (cyl) per group (i.e. row).
mtcars %>%
group_by(cyl,gear) %>%
nest() %>%
mutate(
wt_mean = map_dbl(data,~mean(.x$wt)/cyl)
)
Error in mutate_impl(.data, dots) :
Evaluation error: Result 1 is not a length 1 atomic vector.

Take cyl out of the map call:
mtcars %>%
group_by(cyl,gear) %>%
nest() %>%
mutate(
wt_mean = map_dbl(data, ~mean(.x$wt)) / cyl
)
# A tibble: 8 x 4
cyl gear data wt_mean
<dbl> <dbl> <list> <dbl>
1 6 4 <tibble [4 x 9]> 0.516
2 4 4 <tibble [8 x 9]> 0.595
3 6 3 <tibble [2 x 9]> 0.556
4 8 3 <tibble [12 x 9]> 0.513
5 4 3 <tibble [1 x 9]> 0.616
6 4 5 <tibble [2 x 9]> 0.457
7 8 5 <tibble [2 x 9]> 0.421
8 6 5 <tibble [1 x 9]> 0.462
map_dbl sees cyl as a length 8 vector because nest removes groups from data.frame. Using cyl in map_* function call (as in OP's example) results in 8 length-8 vectors.
2 other approaches:
Both with same result as above, but keep the grouped variables in the map_* call, per OP's specs:
Re grouping after nest
mtcars %>%
group_by(cyl,gear) %>%
nest() %>%
group_by(cyl, gear) %>%
mutate(wt_mean = map_dbl(data,~mean(.x$wt)/cyl))
map2 for iterating over cyl
mtcars %>%
group_by(cyl,gear) %>%
nest() %>%
mutate(wt_mean = map2_dbl(data, cyl,~mean(.x$wt)/ .y))

In the new release of dplyr 0-8-0, you can now use group_map, which I find very handy for this use case. This is the example by github user #yutannihilation
library(dplyr, warn.conflicts = FALSE)
mtcars %>%
group_by(cyl) %>%
group_map(function(data, group_info) {
tibble::tibble(wt_mean = mean(data$wt) / group_info$cyl)
})

Related

Specify a row on mutate

I have this code which adds a different text for each line in the df.
mt<- mtcars %>% group_by(cyl,am) %>% nest() %>%
mutate(formula = "Add separate model for each row in text like mpg~wt for one row, mpg~wt+hp for another etc.")
mt$formula[[1]] <- "mpg~wt"
mt$formula[[2]] <- "mpg~wt+drat"
mt$formula[[3]] <- "mpg~wt+qsec"
mt$formula[[4]] <- "mpg~wt+gear"
mt$formula[[5]] <- "mpg~wt"
mt$formula[[6]] <- "mpg~wt+qsec"
Is there a more elegant way of referring to row number 1, like specifying cyl==6 and am==1 instead of the line number?
Thanks
One option would be to use dplyr::case_when:
library(tidyr)
library(dplyr)
mt<- mtcars %>% group_by(cyl,am) %>% nest()
mt %>%
mutate(formula = case_when(
cyl == 6 & am == 1 ~ "mpg~wt",
cyl == 8 & am == 0 ~ "mpg~wt+drat"
))
#> # A tibble: 6 x 4
#> # Groups: cyl, am [6]
#> cyl am data formula
#> <dbl> <dbl> <list> <chr>
#> 1 6 1 <tibble [3 x 9]> mpg~wt
#> 2 4 1 <tibble [8 x 9]> <NA>
#> 3 6 0 <tibble [4 x 9]> <NA>
#> 4 8 0 <tibble [12 x 9]> mpg~wt+drat
#> 5 4 0 <tibble [3 x 9]> <NA>
#> 6 8 1 <tibble [2 x 9]> <NA>

Nesting tibbles after grouping by factor variables produces NULL elements in R

I was reviewing some of my old R code when I stumbled upon several errors.
After running each line and playing around with my data I discovered that tidyr::nest()ing tibbles dplyr::group(ed)_by factor variables produced one or more NULL elements.
Here an example with the mtcars data:
library(dplyr)
library(tidyr)
mtcars %>%
as_tibble() %>%
select(cyl, carb, mpg) %>%
mutate(cyl = factor(cyl),
carb = factor(carb)) %>%
group_by(cyl, carb) %>%
nest()
# A tibble: 9 x 3
# cyl carb data
# <fct> <fct> <list>
# 1 6 4 <NULL>
# 2 4 1 <tibble [5 x 1]>
# 3 6 1 <tibble [3 x 1]>
# 4 8 2 <NULL>
# 5 8 4 <NULL>
# 6 4 2 <tibble [6 x 1]>
# 7 8 3 <NULL>
# 8 6 6 <NULL>
# 9 8 8 <NULL>
I thought nest() is taking factors as.numeric() and "getting confused" when different variables present same-named groups. But then I tried:
mtcars %>%
as_tibble() %>%
select(cyl, carb, mpg) %>%
mutate(cyl = factor(cyl) %>% as.numeric(),
carb = factor(carb) %>% as.numeric()) %>%
group_by(cyl, carb) %>%
nest()
and got the same result as when nesting with non-factorial variables:
# A tibble: 9 x 3
# cyl carb data
# <dbl> <dbl> <list>
# 1 2 4 <tibble [4 x 1]>
# 2 1 1 <tibble [5 x 1]>
# 3 2 1 <tibble [2 x 1]>
# 4 3 2 <tibble [4 x 1]>
# 5 3 4 <tibble [6 x 1]>
# 6 1 2 <tibble [6 x 1]>
# 7 3 3 <tibble [3 x 1]>
# 8 2 5 <tibble [1 x 1]>
# 9 3 6 <tibble [1 x 1]>
Compare with:
mtcars %>%
as_tibble() %>%
select(cyl, carb, mpg) %>%
group_by(cyl, carb) %>%
nest()
# A tibble: 9 x 3
# cyl carb data
# <dbl> <dbl> <list>
# 1 6 4 <tibble [4 x 1]>
# 2 4 1 <tibble [5 x 1]>
# 3 6 1 <tibble [2 x 1]>
# 4 8 2 <tibble [4 x 1]>
# 5 8 4 <tibble [6 x 1]>
# 6 4 2 <tibble [6 x 1]>
# 7 8 3 <tibble [3 x 1]>
# 8 6 6 <tibble [1 x 1]>
# 9 8 8 <tibble [1 x 1]>
Since my code used to work fine until last month, I wonder if tidyr was updated lately and the way of handling factor groups by nest() was changed?
Is it generally advisable not to nest data grouped by factor variables or rather not to group_by() on factor variables?
Edit:
In the issue mentioned by aosmith, Hadley references to group_nest() which seems to resolve the problem (caveat: this function reorders the tibble!). Nevertheless, I still wonder why nest() is producing NULLs...
mtcars %>%
as_tibble() %>%
select(cyl, carb, mpg) %>%
mutate(cyl = factor(cyl),
carb = factor(carb)) %>%
group_by(cyl, carb) %>%
group_nest() %>%
unnest %>%
all.equal(.,
mtcars %>%
as_tibble() %>%
select(cyl, carb, mpg) %>%
mutate(cyl = factor(cyl),
carb = factor(carb)))
# [1] TRUE
As aosmith suggested, this has recently been fixed in the dev version of tidyr. Since I didn't recognize that from the issue linked and I couldn't manage do install the dev version, I submitted this question as another issue. Hadley just answered it.

Tidyverse and R: how to count rows in a tibble of a nested dataframe

So, I've checked multiple posts and haven't found anything. According to this, my code should work, but it isn't.
Objective: I want to essentially print out the number of subjects--which in this case is also the number of rows in this tibble.
Code:
data<-read.csv("advanced_r_programming/data/MIE.csv")
make_LD<-function(x){
LongitudinalData<-x%>%
group_by(id)%>%
nest()
structure(list(LongitudinalData), class = "LongitudinalData")
}
print.LongitudinalData<-function(x){
paste("Longitudinal dataset with", x[["id"]], "subjects")
}
x<-make_LD(data)
print(x)
Here's the head of the dataset I'm working on:
> head(x)
[[1]]
# A tibble: 10 x 2
id data
<int> <list>
1 14 <tibble [11,945 x 4]>
2 20 <tibble [11,497 x 4]>
3 41 <tibble [11,636 x 4]>
4 44 <tibble [13,104 x 4]>
5 46 <tibble [13,812 x 4]>
6 54 <tibble [10,944 x 4]>
7 64 <tibble [11,367 x 4]>
8 74 <tibble [11,517 x 4]>
9 104 <tibble [11,232 x 4]>
10 106 <tibble [13,823 x 4]>
Output:
[1] "Longitudinal dataset with subjects"
I've tried every possible combination from the aforementioned stackoverflow post and none seem to work.
Here are two options:
library(tidyverse)
# Create a nested data frame
dat = mtcars %>%
group_by(cyl) %>%
nest %>% as.tibble
cyl data
1 6 <tibble [7 x 10]>
2 4 <tibble [11 x 10]>
3 8 <tibble [14 x 10]>
dat %>%
mutate(nrow=map_dbl(data, nrow))
dat %>%
group_by(cyl) %>%
mutate(nrow = nrow(data.frame(data)))
cyl data nrow
1 6 <tibble [7 x 10]> 7
2 4 <tibble [11 x 10]> 11
3 8 <tibble [14 x 10]> 14
There is a specific function for this in the tidyverse: n()
You can simply do: mtcars %>% group_by(cyl) %>% summarise(rows = n())
> mtcars %>% group_by(cyl) %>% summarise(rows = n())
# A tibble: 3 x 2
cyl rows
<dbl> <int>
1 4 11
2 6 7
3 8 14
In more sophisticated cases, in which subjects may span across multiple rows ("long format data"), you can do (assuming hp denotes the subject):
> mtcars %>% group_by(cyl, hp) %>% #always group by subject-ID last
+ summarise(n = n()) %>% #observations per subject and cyl
+ summarise(n = n()) #subjects per cyl (implicitly summarises across all group-variables except the last)
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 3 x 2
cyl n
<dbl> <int>
1 4 10
2 6 4
3 8 9
Note that the n in the last case is smaller than in the first because there are cars with same amount of cyl and hp that are now counted as just one "subject".

adding summarize output to original tibble

I would like to do something in between mutate and summarize.
I would like to calculate a summary statistic on groups, but retain the original data as a nested object. I assume this is a pretty generic task, but I can't figure out how to do without invoking a join as well as grouping twice. example code below:
mtcars %>%
group_by(cyl) %>%
nest() %>%
left_join(mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg)))
which produced desired output:
# A tibble: 3 x 3
cyl data mean_mpg
<dbl> <list> <dbl>
1 6 <tibble [7 x 10]> 19.74286
2 4 <tibble [11 x 10]> 26.66364
3 8 <tibble [14 x 10]> 15.10000
but I feel like this is not the "correct" way to do this.
Here is one way to do this without join; Use map_dbl (which is essentially a map with the out come be a vector of type double) from purrr package (one member of the tidyverse family) to calculate the mean of mpg nested in the data column:
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(mean_mpg = map_dbl(data, ~ mean(.x$mpg)))
# A tibble: 3 x 3
# cyl data mean_mpg
# <dbl> <list> <dbl>
#1 6 <tibble [7 x 10]> 19.74286
#2 4 <tibble [11 x 10]> 26.66364
#3 8 <tibble [14 x 10]> 15.10000
Or you can calculate mean_mpg before nesting, and add mean_mpg as one of the group variables:
mtcars %>%
group_by(cyl) %>%
mutate(mean_mpg = mean(mpg)) %>%
group_by(mean_mpg, add=TRUE) %>%
nest()
# A tibble: 3 x 3
# cyl mean_mpg data
# <dbl> <dbl> <list>
#1 6 19.74286 <tibble [7 x 10]>
#2 4 26.66364 <tibble [11 x 10]>
#3 8 15.10000 <tibble [14 x 10]>

how to nest a column into itself in tidyverse

library(tidyverse)
why does this produce a list column 'am':
mtcars %>%
group_by(cyl) %>%
mutate(am=list(mtcars[,'am']))
but not:
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(am=list(mtcars[,'am']))
Error: not compatible with STRSXP
I realize this is a bit of a contrived example, but it's relevant to what I'm working on. Does mutate not scope outside its environment?
mtcars %>% group_by(cyl) %>% nest()
## # A tibble: 3 × 2
## cyl data
## <dbl> <list>
## 1 6 <tibble [7 × 10]>
## 2 4 <tibble [11 × 10]>
## 3 8 <tibble [14 × 10]>
has three rows, so any column you need has to have three elements, as well.
If you want the full am column for each row, you can either mutate rowwise, which will evaluate the mutate call separately for each row,
mtcars %>% group_by(cyl) %>% nest() %>% rowwise() %>% mutate(am = list(mtcars$am))
## Source: local data frame [3 x 3]
## Groups: <by row>
##
## # A tibble: 3 × 3
## cyl data am
## <dbl> <list> <list>
## 1 6 <tibble [7 × 10]> <dbl [32]>
## 2 4 <tibble [11 × 10]> <dbl [32]>
## 3 8 <tibble [14 × 10]> <dbl [32]>
or without rowwise, just repeat the desired list for each row:
mtcars %>% group_by(cyl) %>% nest() %>% mutate(am = rep(list(mtcars$am), n()))
## # A tibble: 3 × 3
## cyl data am
## <dbl> <list> <list>
## 1 6 <tibble [7 × 10]> <dbl [32]>
## 2 4 <tibble [11 × 10]> <dbl [32]>
## 3 8 <tibble [14 × 10]> <dbl [32]>

Resources