add grouping variable for nested tibbles - r

This is a follow-up to this question.
I need to be able to group_by() columns in my new nested table. I can't find a purrr function that is does this (although I know a solution exists). I need to group_by in each table to apply additional summarizing functions and fit linear models appropriate. The example here is just a dummy example.
library(tidyverse)
set.seed(2)
N <- 30
df <- tibble(type = rep(c("small","medium","high"), each=N/3),
dummy = rep(c(1,5,10),each=10),
xvals = rep(1:10,3),
A = rnorm(N)*dummy,
B = rnorm(N)*dummy,
C = rnorm(N)*dummy) %>%
mutate(type = factor(type, levels=c("small","medium","high"))) %>%
select(-dummy) %>%
pivot_longer(cols=-c(type,xvals), names_to="metric", values_to = "value") %>%
group_by(type) %>%
group_nest(.key="data")
This produces a tibble with two columns:
df
# A tibble: 3 x 2
type data
<fct> <list>
1 small <tibble [30 x 3]>
2 medium <tibble [30 x 3]>
3 high <tibble [30 x 3]>
This is an example of what I want to do across all the nested tibbles:
df[[2]][[1]] %>%
group_by(metric) %>%
summarize(mean = mean(value))
# A tibble: 3 x 2
metric mean
<chr> <dbl>
1 A 0.211
2 B -0.296
3 C -0.391

After the group_nest, the 'data' is a list column of tibbles and there are only two columns 'type' and 'data'. If we need to create a grouping based on the list column, loop through the list with map and then do the group_by
library(dplyr)
library(tidyr)
library(purrr)
df %>%
mutate(data = map(data, ~ .x %>%
group_by(metric) %>%
summarize(mean = mean(value)))) -> out
out$data[[1]]
# A tibble: 3 x 2
# metric mean
# <chr> <dbl>
#1 A 0.115
#2 B 0.323
#3 C -0.326
NOTE: Output values will be different as there was not set seed

Related

How can I wrangle the data frame based on the parameters inside a nested column?

So let's say we have this df:
a = c(rep(1,5),rep(0,5),rep(1,5),rep(0,5))
b = c(rep(4,5),rep(3,5),rep(2,5),rep(1,5))
c = c(rep("w",5),rep("x",5),rep("y",5),rep("z",5))
df = data.frame(a,b,c)
df = df %>%
nest(data=c(a,b))
I want to use parameters from inside the nested "data" column to do things to the entire dataframe, for example use filter() to eliminate rows where the sum of "a" inside the nested "data" is equal to 0. Or to arrange the rows of the dataframe by the max() of b. How can I do this?
I cam up with a pretty dumb way of doing this, but I am not happy, as this isn't really applicable to the larger datasets I'm working with:
sum_column = function(df){
df = df %>%
summarize(value=sum(a))
return(df[[1]][1])
}
#so many a new column with the sum of a, and THEN filter by that
df = df %>%
mutate(sum_of_a = map(data, ~sum_column(.x))) %>%
filter(!sum_of_a==0)
map returns a list, perhaps you want map_dbl?
library(dplyr)
library(purrr)
df %>%
mutate(sum_of_a = map_dbl(data, ~ sum(.x$a))) %>%
filter(!sum_of_a == 0)
# # A tibble: 2 × 3
# c data sum_of_a
# <chr> <list> <dbl>
# 1 w <tibble [5 × 2]> 5
# 2 y <tibble [5 × 2]> 5
or more directly (in case you no longer need sum_of_a):
df %>%
filter(abs(map_dbl(data, ~ sum(.x$a))) > 0)
# # A tibble: 2 × 2
# c data
# <chr> <list>
# 1 w <tibble [5 × 2]>
# 2 y <tibble [5 × 2]>
(The only reason I changed from ! . == 0 to abs(.) > 0 is due to floating-point tests of equality, not wanting to assume the precision and scale of numbers you're actually using. C.f., Why are these numbers not equal?, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f.)

Create column of data frames based on function

I would like to use the map function with the tidyverse to create a column of data frames based on arguments from some, but not all, of the columns of the original data frame/tibble.
I would prefer to be able to use the map function so that I can replace this with future_map to utilize parallel computing.
With the exception of this solution not using map, this solution produces the correct end result (see also this question and answer: How to use rowwise to create a list column based on a function):
library(tidyverse)
library(purrr)
df <- data.frame(a= c(1,2,3), b=c(2,3,4), c=c(6,5,8))
fun <- function(q,y) {
r <- data.frame(col1 = c(q+y, q, q, y), col2 = c(q,q,q,y))
r
}
result1 <- df %>% rowwise(a) %>% mutate(list1 = list(fun(a, b)))
> result1
# A tibble: 3 × 4
# Rowwise: a
a b c list1
<dbl> <dbl> <dbl> <list>
1 1 2 6 <df [4 × 2]>
2 2 3 5 <df [4 × 2]>
3 3 4 8 <df [4 × 2]>
How can I instead do this with map? Here are three incorrect attempts:
Incorrect attempt 1:
wrong1 <- df %>% mutate(list1 = map(list(a,b), fun))
Incorrect attempt 2:
wrong2 <- df %>% mutate(list1 = map(c(a,b), fun))
Incorrect attempt 2:
wrong3 <- df %>% mutate(list1 = list(map(list(a,b), fun)))
The error I get is x argument "y" is missing, with no default. And I am not sure how to pass multiple arguments into a situation like this.
I would like a solution with multiple arguments, but if that is not possible, let's move to a function with one argument.
fun_one_arg <- function(q) {
r <- data.frame(col1 = c(q, q, q, q+q), col2 = c(3*q,q,q,q/2))
r
}
wrong4 <- df %>% mutate(list1 = map(a, fun_one_arg))
wrong5 <- df %>% mutate(list1 = list(map(a, fun_one_arg)))
These run, but the fourth columns are not data frames, as I would have expected.
We can use map2 as there are two arguments
library(dplyr)
df %>%
mutate(list1 = map2(a, b, fun)) %>%
as_tibble
# A tibble: 3 x 4
a b c list1
<dbl> <dbl> <dbl> <list>
1 1 2 6 <df [4 × 2]>
2 2 3 5 <df [4 × 2]>
3 3 4 8 <df [4 × 2]>
Or another option is pmap which can take more than 2 columns as well. The ..1, ..2 represents the columns in the same order
df %>%
mutate(list1 = pmap(across(c(a, b)), ~ fun(..1, ..2))) %>%
as_tibble

map_dbl with list column gives Error: Result 1 must be a single double, not a double vector of length

diamonds %>%
mutate(log_price = log(price)) %>%
group_by(cut) %>%
nest() %>%
mutate(scale_log = map_dbl(data, ~.x$log_price %>% scale %>% as.vector))
Desired outcome is a new variable scale_log that is just a vector for each row of the data frame. The code above gives:
Error: Result 1 must be a single double, not a double vector of length 1610
It works if I just use regular map:
x <- diamonds %>%
mutate(log_price = log(price)) %>%
group_by(cut) %>%
nest() %>%
mutate(scale_log = map(data, ~.x$log_price %>% scale %>% as.vector))
But when I do this x$scale_log is a list whereas I wanted it to be just a vector:
x %>% glimpse
Rows: 5
Columns: 3
Groups: cut [5]
$ cut <ord> Ideal, Premium, Good, Very Good, Fair
$ data <list> [<tbl_df[21551 x 11]>, <tbl_df[13791 x 11]>, <tbl_df[4906 x 11]>, <tbl_df[12082 x 11]>, <tbl_df[1610 x 11]>]
$ scale_log <list> [<-1.8668994653, -1.8245259872, -1.8127394789, -1.8010892338, -1.6532201679, -1.6532201679, -1.6532201679, -1.6507226822, -1.6507226822, -1…
How can I use map_dbl to get my desired outcome?
Don't nest nor use any version of map, you can use group_by + mutate combination.
library(dplyr)
diamonds %>%
mutate(log_price = log(price)) %>%
group_by(cut) %>%
mutate(scale_log = as.numeric(scale(log_price)))
The issue with map_dbl approach is :
library(tidyr)
library(purrr)
diamonds %>%
mutate(log_price = log(price)) %>%
group_by(cut) %>%
nest()
# cut data
# <ord> <list>
#1 Ideal <tibble [21,551 × 10]>
#2 Premium <tibble [13,791 × 10]>
#3 Good <tibble [4,906 × 10]>
#4 Very Good <tibble [12,082 × 10]>
#5 Fair <tibble [1,610 × 10]>
This is a 5 row-dataframe, when you use map_dbl it returns you values which is same as nrow(diamonds) i.e 53940. mutate expects output to be of same number of rows i.e 5, hence there is an error.
You can solve this by keeping the data in a list using map and then unnest.
diamonds %>%
mutate(log_price = log(price)) %>%
group_by(cut) %>%
nest() %>%
summarise(scale_log = map(data, ~.x$log_price %>% scale %>% as.vector)) %>%
unnest(scale_log)

How to calculate the sum of rows by each gvkey respectively?

I tried to calculate the cumulative sum of the twitter followers for each gvkey respectively ,and I use the group_by function,but the output is still the sum of the entire column,I suppose it is the problem of the " for (i in i:nrow(premod_e))
predmod_e <- predmod_e %>%
arrange(gvkey, date) %>%#arrange the gvkey and date
group_by(gvkey)#use group_by for respective calculation
for (i in 1:nrow(predmod_e)) {
predmod_e[i+1,]$x <- predmod_e[i+1,]$x + predmod_e[i,]$x
}#for loop to calculate
Perhaps just this:
predmod_e <- predmod_e %>%
arrange(gvkey, date) %>%
group_by(gvkey) %>%
mutate(newx = cumsum(x))
If you want to do something with the groups yourself (i.e., not with a dplyr verb), then you should use the groups as they are "known" by the tidy verbs. Luckily, they are merely stored as an attribute:
mtcars %>%
group_by(cyl) %>%
attr(., "groups")
# # A tibble: 3 x 2
# cyl .rows
# <dbl> <list>
# 1 4 <int [11]>
# 2 6 <int [7]>
# 3 8 <int [14]>

calculate multiple t-tests for all the combinations of 2 list elements in R tidyverse environment

I am looking for a way to calculate multiple t-tests for all the combinations of 2 list elements in R tidyverse environment.
I would like to test the means of Miles/(US) gallon based on Transmission for each combination of cyl and vs. My working example is this code:
mtcars %>%
filter(cyl==8 & vs == 0) %>%
mutate(am = as.factor(am)) %>%
# independent t-test
t.test(mpg ~ am, data = ., paired = FALSE)%>%
broom::tidy() %>%
mutate(cyl = 8) %>%
mutate(vs = 0) %>%
select(cyl, vs, everything())
I wrote this piece of code:
cyl_list <- list(unique(mtcars$cyl)) # 6 4 8
vs_list <- list(unique(mtcars$vs)) # 0 1
complete_t_test <- function(cyl_par, vs_par){
mtcars %>%
filter(cyl=={cyl_par} & vs == {vs_par}) %>%
mutate(am = as.factor(am)) %>%
# independent t-test
t.test(mpg ~ am, data = ., paired = FALSE) %>%
broom::tidy() %>%
mutate(cyl = {cyl_par}) %>%
mutate(vs = {vs_par}) %>%
select(cyl, vs, everything())}
I was thinking of something similar to purrr::map2(cyl_list, vs_list, complete_t_test)
but it did not work.
List columns may be a viable solution (see book R for Data Science, chapter 25). I create a list column using nest(), then do the t-tests, and unnest() again to see the results.
NB: Your example fails for several combinations in the mtcars data, and therefore I use possibly() to do the t-test only if appropriate data are available.
library("tidyverse")
f1 <- possibly(~t.test(mpg ~ am, data = .x), otherwise = NULL)
mtcars %>%
group_by(cyl, vs) %>%
nest() %>% # create list columns
mutate(res = map(data, ~f1(.x))) %>% # do t-tests
mutate(res = map(res, broom::tidy)) %>% # tidy()
unnest(res) %>% # unnest list columns
select(1:8) # show some columns for stackoverflow
#> # A tibble: 2 x 8
#> # Groups: cyl, vs [2]
#> cyl vs data estimate estimate1 estimate2 statistic p.value
#> <dbl> <dbl> <list<df[,9]>> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 [10 x 9] -5.47 22.9 28.4 -2.76 0.0254
#> 2 8 0 [14 x 9] -0.350 15.0 15.4 -0.391 0.704
Created on 2019-11-04 by the reprex package (v0.3.0)
Write a function which calculates t.test between one combination. Use cross_df to create all combinations and apply the function complete_t_test to each combination.
library(tidyverse)
complete_t_test <- function(cyl_par, vs_par) {
tryCatch({
mtcars %>%
filter(cyl== cyl_par & vs == vs_par) %>%
t.test(mpg ~ am, data = ., paired = FALSE) %>%
broom::tidy()
}, error = function(e) return(NA))
}
cyl_list <- unique(mtcars$cyl)
vs_list <- unique(mtcars$vs)
cross_df(list(a = cyl_list, b = vs_list)) %>%
mutate(t_test = map2(a, b, complete_t_test))
# a b t_test
# <dbl> <dbl> <list>
#1 6 0 <lgl [1]>
#2 4 0 <lgl [1]>
#3 8 0 <tibble [1 × 10]>
#4 6 1 <lgl [1]>
#5 4 1 <tibble [1 × 10]>
#6 8 1 <lgl [1]>

Resources