how to nest a column into itself in tidyverse - r

library(tidyverse)
why does this produce a list column 'am':
mtcars %>%
group_by(cyl) %>%
mutate(am=list(mtcars[,'am']))
but not:
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(am=list(mtcars[,'am']))
Error: not compatible with STRSXP
I realize this is a bit of a contrived example, but it's relevant to what I'm working on. Does mutate not scope outside its environment?

mtcars %>% group_by(cyl) %>% nest()
## # A tibble: 3 × 2
## cyl data
## <dbl> <list>
## 1 6 <tibble [7 × 10]>
## 2 4 <tibble [11 × 10]>
## 3 8 <tibble [14 × 10]>
has three rows, so any column you need has to have three elements, as well.
If you want the full am column for each row, you can either mutate rowwise, which will evaluate the mutate call separately for each row,
mtcars %>% group_by(cyl) %>% nest() %>% rowwise() %>% mutate(am = list(mtcars$am))
## Source: local data frame [3 x 3]
## Groups: <by row>
##
## # A tibble: 3 × 3
## cyl data am
## <dbl> <list> <list>
## 1 6 <tibble [7 × 10]> <dbl [32]>
## 2 4 <tibble [11 × 10]> <dbl [32]>
## 3 8 <tibble [14 × 10]> <dbl [32]>
or without rowwise, just repeat the desired list for each row:
mtcars %>% group_by(cyl) %>% nest() %>% mutate(am = rep(list(mtcars$am), n()))
## # A tibble: 3 × 3
## cyl data am
## <dbl> <list> <list>
## 1 6 <tibble [7 × 10]> <dbl [32]>
## 2 4 <tibble [11 × 10]> <dbl [32]>
## 3 8 <tibble [14 × 10]> <dbl [32]>

Related

Since .key is deprecated how it is possible to rename the data column in nest()?

Before .key was deprecated I did this:
library(tidyverse)
mtcars %>% group_by(cyl) %>% nest(.key = "my_name")
The help of nest() points out that now this is performed using tidy select, but I don't know how.
You can use the new nest_by function in dplyr 1.0.0 which works similar to what you had previously with nest.
library(dplyr)
mtcars %>% group_by(cyl) %>% nest_by(.key = "my_name")
# cyl my_name
# <dbl> <list<tbl_df[,10]>>
#1 4 [11 × 10]
#2 6 [7 × 10]
#3 8 [14 × 10]
You can also do the same without grouping.
mtcars %>% nest_by(cyl, .key = "my_name")
You can use group_cols() to refer to grouping variables:
mtcars %>% group_by(cyl) %>% nest(my_name = !group_cols())
#> # A tibble: 3 x 2
#> # Groups: cyl [3]
#> cyl my_name
#> <dbl> <list>
#> 1 6 <tibble [7 × 10]>
#> 2 4 <tibble [11 × 10]>
#> 3 8 <tibble [14 × 10]>
mtcars %>% nest(my_name = !cyl)
#> # A tibble: 3 x 2
#> # Groups: cyl [3]
#> cyl my_name
#> <dbl> <list>
#> 1 6 <tibble [7 × 10]>
#> 2 4 <tibble [11 × 10]>
#> 3 8 <tibble [14 × 10]>
The name can be provided directly in the arguments of nest:
mtcars %>% nest( my_name = -cyl ) # Nest by everything except cyl
# # A tibble: 3 x 2
# cyl my_name
# <dbl> <list>
# 1 6 <tibble [7 × 10]>
# 2 4 <tibble [11 × 10]>
# 3 8 <tibble [14 × 10]>

Nesting tibbles after grouping by factor variables produces NULL elements in R

I was reviewing some of my old R code when I stumbled upon several errors.
After running each line and playing around with my data I discovered that tidyr::nest()ing tibbles dplyr::group(ed)_by factor variables produced one or more NULL elements.
Here an example with the mtcars data:
library(dplyr)
library(tidyr)
mtcars %>%
as_tibble() %>%
select(cyl, carb, mpg) %>%
mutate(cyl = factor(cyl),
carb = factor(carb)) %>%
group_by(cyl, carb) %>%
nest()
# A tibble: 9 x 3
# cyl carb data
# <fct> <fct> <list>
# 1 6 4 <NULL>
# 2 4 1 <tibble [5 x 1]>
# 3 6 1 <tibble [3 x 1]>
# 4 8 2 <NULL>
# 5 8 4 <NULL>
# 6 4 2 <tibble [6 x 1]>
# 7 8 3 <NULL>
# 8 6 6 <NULL>
# 9 8 8 <NULL>
I thought nest() is taking factors as.numeric() and "getting confused" when different variables present same-named groups. But then I tried:
mtcars %>%
as_tibble() %>%
select(cyl, carb, mpg) %>%
mutate(cyl = factor(cyl) %>% as.numeric(),
carb = factor(carb) %>% as.numeric()) %>%
group_by(cyl, carb) %>%
nest()
and got the same result as when nesting with non-factorial variables:
# A tibble: 9 x 3
# cyl carb data
# <dbl> <dbl> <list>
# 1 2 4 <tibble [4 x 1]>
# 2 1 1 <tibble [5 x 1]>
# 3 2 1 <tibble [2 x 1]>
# 4 3 2 <tibble [4 x 1]>
# 5 3 4 <tibble [6 x 1]>
# 6 1 2 <tibble [6 x 1]>
# 7 3 3 <tibble [3 x 1]>
# 8 2 5 <tibble [1 x 1]>
# 9 3 6 <tibble [1 x 1]>
Compare with:
mtcars %>%
as_tibble() %>%
select(cyl, carb, mpg) %>%
group_by(cyl, carb) %>%
nest()
# A tibble: 9 x 3
# cyl carb data
# <dbl> <dbl> <list>
# 1 6 4 <tibble [4 x 1]>
# 2 4 1 <tibble [5 x 1]>
# 3 6 1 <tibble [2 x 1]>
# 4 8 2 <tibble [4 x 1]>
# 5 8 4 <tibble [6 x 1]>
# 6 4 2 <tibble [6 x 1]>
# 7 8 3 <tibble [3 x 1]>
# 8 6 6 <tibble [1 x 1]>
# 9 8 8 <tibble [1 x 1]>
Since my code used to work fine until last month, I wonder if tidyr was updated lately and the way of handling factor groups by nest() was changed?
Is it generally advisable not to nest data grouped by factor variables or rather not to group_by() on factor variables?
Edit:
In the issue mentioned by aosmith, Hadley references to group_nest() which seems to resolve the problem (caveat: this function reorders the tibble!). Nevertheless, I still wonder why nest() is producing NULLs...
mtcars %>%
as_tibble() %>%
select(cyl, carb, mpg) %>%
mutate(cyl = factor(cyl),
carb = factor(carb)) %>%
group_by(cyl, carb) %>%
group_nest() %>%
unnest %>%
all.equal(.,
mtcars %>%
as_tibble() %>%
select(cyl, carb, mpg) %>%
mutate(cyl = factor(cyl),
carb = factor(carb)))
# [1] TRUE
As aosmith suggested, this has recently been fixed in the dev version of tidyr. Since I didn't recognize that from the issue linked and I couldn't manage do install the dev version, I submitted this question as another issue. Hadley just answered it.

Tidyverse and R: how to count rows in a tibble of a nested dataframe

So, I've checked multiple posts and haven't found anything. According to this, my code should work, but it isn't.
Objective: I want to essentially print out the number of subjects--which in this case is also the number of rows in this tibble.
Code:
data<-read.csv("advanced_r_programming/data/MIE.csv")
make_LD<-function(x){
LongitudinalData<-x%>%
group_by(id)%>%
nest()
structure(list(LongitudinalData), class = "LongitudinalData")
}
print.LongitudinalData<-function(x){
paste("Longitudinal dataset with", x[["id"]], "subjects")
}
x<-make_LD(data)
print(x)
Here's the head of the dataset I'm working on:
> head(x)
[[1]]
# A tibble: 10 x 2
id data
<int> <list>
1 14 <tibble [11,945 x 4]>
2 20 <tibble [11,497 x 4]>
3 41 <tibble [11,636 x 4]>
4 44 <tibble [13,104 x 4]>
5 46 <tibble [13,812 x 4]>
6 54 <tibble [10,944 x 4]>
7 64 <tibble [11,367 x 4]>
8 74 <tibble [11,517 x 4]>
9 104 <tibble [11,232 x 4]>
10 106 <tibble [13,823 x 4]>
Output:
[1] "Longitudinal dataset with subjects"
I've tried every possible combination from the aforementioned stackoverflow post and none seem to work.
Here are two options:
library(tidyverse)
# Create a nested data frame
dat = mtcars %>%
group_by(cyl) %>%
nest %>% as.tibble
cyl data
1 6 <tibble [7 x 10]>
2 4 <tibble [11 x 10]>
3 8 <tibble [14 x 10]>
dat %>%
mutate(nrow=map_dbl(data, nrow))
dat %>%
group_by(cyl) %>%
mutate(nrow = nrow(data.frame(data)))
cyl data nrow
1 6 <tibble [7 x 10]> 7
2 4 <tibble [11 x 10]> 11
3 8 <tibble [14 x 10]> 14
There is a specific function for this in the tidyverse: n()
You can simply do: mtcars %>% group_by(cyl) %>% summarise(rows = n())
> mtcars %>% group_by(cyl) %>% summarise(rows = n())
# A tibble: 3 x 2
cyl rows
<dbl> <int>
1 4 11
2 6 7
3 8 14
In more sophisticated cases, in which subjects may span across multiple rows ("long format data"), you can do (assuming hp denotes the subject):
> mtcars %>% group_by(cyl, hp) %>% #always group by subject-ID last
+ summarise(n = n()) %>% #observations per subject and cyl
+ summarise(n = n()) #subjects per cyl (implicitly summarises across all group-variables except the last)
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 3 x 2
cyl n
<dbl> <int>
1 4 10
2 6 4
3 8 9
Note that the n in the last case is smaller than in the first because there are cars with same amount of cyl and hp that are now counted as just one "subject".

adding summarize output to original tibble

I would like to do something in between mutate and summarize.
I would like to calculate a summary statistic on groups, but retain the original data as a nested object. I assume this is a pretty generic task, but I can't figure out how to do without invoking a join as well as grouping twice. example code below:
mtcars %>%
group_by(cyl) %>%
nest() %>%
left_join(mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg)))
which produced desired output:
# A tibble: 3 x 3
cyl data mean_mpg
<dbl> <list> <dbl>
1 6 <tibble [7 x 10]> 19.74286
2 4 <tibble [11 x 10]> 26.66364
3 8 <tibble [14 x 10]> 15.10000
but I feel like this is not the "correct" way to do this.
Here is one way to do this without join; Use map_dbl (which is essentially a map with the out come be a vector of type double) from purrr package (one member of the tidyverse family) to calculate the mean of mpg nested in the data column:
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(mean_mpg = map_dbl(data, ~ mean(.x$mpg)))
# A tibble: 3 x 3
# cyl data mean_mpg
# <dbl> <list> <dbl>
#1 6 <tibble [7 x 10]> 19.74286
#2 4 <tibble [11 x 10]> 26.66364
#3 8 <tibble [14 x 10]> 15.10000
Or you can calculate mean_mpg before nesting, and add mean_mpg as one of the group variables:
mtcars %>%
group_by(cyl) %>%
mutate(mean_mpg = mean(mpg)) %>%
group_by(mean_mpg, add=TRUE) %>%
nest()
# A tibble: 3 x 3
# cyl mean_mpg data
# <dbl> <dbl> <list>
#1 6 19.74286 <tibble [7 x 10]>
#2 4 26.66364 <tibble [11 x 10]>
#3 8 15.10000 <tibble [14 x 10]>

extract model info from model saved as list column in r

I'm trying to extract model info from model in a list column.
Using mtcars to illustrate my problem:
mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.))) %>%
mutate(aic=AIC(model))
what I got is error message:
Error in mutate_impl(.data, dots) :
Evaluation error: no applicable method for 'logLik' applied to an object of class "list".
But when I do it this way, it works.
mtcars %>%
group_by(cyl) %>%
do(model= lm(mpg~wt, data=.)) %>%
mutate(aic=AIC(model))
Can anyone explain why? Why the second way works? I could not figure it out. In both cases, the list column 'model' contains model info . But there might be some differences... Thanks a lot.
Let's compare the differences between these two approaches. We can run your entire code in addition to the last AIC call and save the results to a and b.
a <- mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.)))
b <- mtcars %>%
group_by(cyl) %>%
do(model= lm(mpg~wt, data=.))
Now we can print the results in the console.
a
# A tibble: 3 x 3
cyl data model
<dbl> <list> <list>
1 6 <tibble [7 x 10]> <S3: lm>
2 4 <tibble [11 x 10]> <S3: lm>
3 8 <tibble [14 x 10]> <S3: lm>
b
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
cyl model
* <dbl> <list>
1 4 <S3: lm>
2 6 <S3: lm>
3 8 <S3: lm>
Now we can see dataframe b is grouped by row, while dataframe a is not. This is the key.
To extract AIC in dataframe a, we can use the rowwise function to group dataframe by each row.
mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.))) %>%
rowwise() %>%
mutate(aic=AIC(model))
Source: local data frame [3 x 4]
Groups: <by row>
# A tibble: 3 x 4
cyl data model aic
<dbl> <list> <list> <dbl>
1 6 <tibble [7 x 10]> <S3: lm> 25.65036
2 4 <tibble [11 x 10]> <S3: lm> 61.48974
3 8 <tibble [14 x 10]> <S3: lm> 63.31555
Or we can use the map_dbl function because we know each AIC is numeric.
mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.))) %>%
mutate(aic = map_dbl(model, AIC))
# A tibble: 3 x 4
cyl data model aic
<dbl> <list> <list> <dbl>
1 6 <tibble [7 x 10]> <S3: lm> 25.65036
2 4 <tibble [11 x 10]> <S3: lm> 61.48974
3 8 <tibble [14 x 10]> <S3: lm> 63.31555

Resources