Join tibbles in list to one tibble [duplicate] - r

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 3 years ago.
I have a list of two data frames
a = list(
mtcars %>% as_tibble() %>% select(-vs),
mtcars %>% as_tibble() %>% sample_n(17)
)
and add a new column to the data sets by
b = a %>%
map(~ mutate(.x, class = floor(runif(nrow(.x), 0, 2)))) %>%
map(~ nest(.x, -class))
Now I want to join the two list elements to one tibble based on class. Specifically, I am looking for a "smoother" solution than inner_join(pluck(b, 1), pluck(b, 2), "class") which gives the desired results but quickly gets messy if more data sets are involved in the list a.

This question is not super clear, but it seemed like there might be enough use cases to go for it. I added a few more data frames to a, constructed similarly, because the sample you used is too small to really see what you need to deal with.
library(tidyverse)
set.seed(123)
a <- list(
mtcars %>% as_tibble() %>% select(-vs),
mtcars %>% as_tibble() %>% sample_n(17),
mtcars %>% as_tibble() %>% slice(1:10),
mtcars %>% as_tibble() %>% select(mpg, cyl, disp)
)
# same construction of b as in the question
You can use purrr::reduce to carry out the inner_join call repeatedly, returning a single data frame of nested data frames. That's straightforward enough, but I couldn't figure out a good way to supply the suffix argument to the join, which assigns .x and .y by default to differentiate between duplicate column names. So you get these weird names:
b %>%
reduce(inner_join, by = "class")
#> # A tibble: 2 x 5
#> class data.x data.y data.x.x data.y.y
#> <dbl> <list> <list> <list> <list>
#> 1 1 <tibble [11 × 10… <tibble [8 × 11… <tibble [3 × 11… <tibble [17 × …
#> 2 0 <tibble [21 × 10… <tibble [9 × 11… <tibble [7 × 11… <tibble [15 × …
You could probably deal with the names by creating something like data1, data2, etc before the reduce, but the quickest thing I decided on was replacing the suffixes with just the index of each data frame from the list b. A more complicated naming scheme would be a task for a different question.
b %>%
reduce(inner_join, by = "class") %>%
rename_at(vars(starts_with("data")),
str_replace, "(\\.\\w)+$", as.character(1:length(b))) %>%
names()
#> [1] "class" "data1" "data2" "data3" "data4"

Related

Unable to access nested data elements inside mutate

I am trying to understand why the following code doesn't work. My understanding is it will take data$Sepal.Length (element within the nested data column) and iterate that one(the vector) over the function sum.
df <- iris %>%
nest(-Species) %>%
mutate(Total.Sepal.Length = map_dbl(data$Sepal.Length, sum, na.rm = TRUE))
print(df)
But this throws an error Total.Sepal.Length must be size 3 or 1, not 0. The following code works by using anonymous function as how it is usually accessed
df <- iris %>%
nest(-Species) %>%
mutate(Total.Sepal.Length = map_dbl(data, function(x) sum(x$Sepal.Length, na.rm = TRUE)))
print(df)
I am trying to understand why the previous code didn't work even though I am correctly passing arguments to mutate and map.
You should do this:
df <- iris %>%
nest(-Species) %>%
mutate(Total.Sepal.Length = map_dbl(data, ~sum(.x$Sepal.Length, na.rm = TRUE)))
Two things: any reason you're not using group_by?
Second: your initial mutate is trying to perform:
map_dbl(df$data$Sepal.Length, sum, na.rm = TRUE)
Which brings an empty result, because df$data$Total.Sepal.Length is NULL (you have to access each list element to access the columns, so df$data[[1]]$Total.Sepal.Length works)
Output:
# A tibble: 3 × 3
Species data Total.Sepal.Length
<fct> <list> <dbl>
1 setosa <tibble [50 × 4]> 250.
2 versicolor <tibble [50 × 4]> 297.
3 virginica <tibble [50 × 4]> 329.

pmap_df inside mutate applying to entire dataframe not just row

I have a dataframe where each row includes arguments that I want to pass into a function iteratively. The function itself returns a dataframe with a few rows. I would like to keep the arguments and results together in one dataframe by applying pmap_df like you can with pmap_dbl inside of a mutate to add a new column with the results from the function. With the code below, I am able to get a column with nested data in it, but every row contains the data for all of the results, not just the ones corresponding to that row.
library(tidyr)
example_function <- function(data, string, ...){
word_one <- paste(data$word_one, string)
word_two <- paste(data$word_two, string)
output <- data_frame(result_words = c(word_one, word_two))
}
fake_data <- tibble(group_id = rep(c(1, 2), each = 3),
word_one = c("hello", "goodbye", "today",
"apple", "banana", "coconut"),
word_two = c("my", "name", "is",
"ellie", "good", "morning"))
test <- fake_data %>%
group_by(group_id) %>%
nest() %>%
mutate(string = "not working") %>%
mutate(final_output = list(purrr::pmap_df(.l = ., .f = example_function)))
The output looks like:
Rows: 2
Columns: 4
Groups: group_id [2]
$ group_id <dbl> 1, 2
$ data <list> [<tbl_df[3 x 2]>], [<tbl_df[3 …
$ string <chr> "not working", "not working"
$ final_output <list> [<tbl_df[12 x 1]>], [<tbl_df[…
What I would like to have would be for each of the final outputs to have only 6 rows in each dataframe, corresponding to the inputs from the nested data column. Is this possible?
With the OP's function, it may be easily done without any pmap (return the output from the function)
example_function <- function(data, string, ...){
word_one <- paste(data$word_one, string)
word_two <- paste(data$word_two, string)
output <- data_frame(result_words = c(word_one, word_two))
output
}
As it is a nest_by, directly apply the function
library(dplyr)
fake_data %>%
nest_by(group_id) %>%
mutate(string = "not working") %>%
mutate(final_output = list(example_function(data, string)))
# A tibble: 2 × 4
# Rowwise: group_id
group_id data string final_output
<dbl> <list<tibble[,2]>> <chr> <list>
1 1 [3 × 2] not working <tibble [6 × 1]>
2 2 [3 × 2] not working <tibble [6 × 1]>
With pmap, extract the contents as a list to an object 'x1' then apply the OP's function on the list elements i.e. x1$data and x1$string
library(purrr)
library(stringr)
fake_data %>%
nest_by(group_id) %>%
mutate(string = "not working") %>%
ungroup %>%
mutate(final_output = pmap(across(-group_id),
~ {
x1 <- list(...)
example_function(x1$data, x1$string)
}))
# A tibble: 2 × 4
group_id data string final_output
<dbl> <list<tibble[,2]>> <chr> <list>
1 1 [3 × 2] not working <tibble [6 × 1]>
2 2 [3 × 2] not working <tibble [6 × 1]>

R purrr:map on a grouped/nested tibble

I would like to apply a function across columns of a nested grouped tibble as in the example below.
library(tidyverse)
df <- swiss %>%
group_by(Catholic > 20) %>%
nest()
Which results in a tibble that looks like:
> df
# A tibble: 2 x 2
# Groups: Catholic > 20 [2]
`Catholic > 20` data
<lgl> <list>
1 FALSE <tibble [26 × 6]>
2 TRUE <tibble [21 × 6]>
Now I make some function to build a model
fit <- function(df, modL = NA){
if (modL == 1) {fit <- lm(Fertility ~ Education, data = df)}
if (modL == 2) {fit <- lm(Fertility ~ Education + Examination, data = df)}
fit
}
Now I map that model to columns of the grouped data and make two new variables to store the model fits.
df <- df %>%
mutate(model1 = map(data, fit, modL = 1)) %>%
mutate(model2 = map(data, fit, modL = 2))
Which produces a tibble with two new columns that contain the model fits
> df
# A tibble: 2 x 4
# Groups: Catholic > 20 [2]
`Catholic > 20` data model1 model2
<lgl> <list> <list> <list>
1 FALSE <tibble [26 × 6]> <lm> <lm>
2 TRUE <tibble [21 × 6]> <lm> <lm>
What I want to achieve is a purr-type map function that does the same thing as the following code.
anova(df$model1[[1]], df$model2[[1]])
anova(df$model1[[2]], df$model2[[2]])
I though the following code would work, but it does not.
map(df[,3:4], anova)
Gurus, how do I map a function across columns of a nested and grouped dataset to give one result per row using the columns of that row as input?
Brant
df %>%
mutate(anova = map2(model1, model2, ~ anova(.x,.y)))%>%
mutate(pvalue = map_dbl(anova, ~.$`Pr(>F)`[2]))
I think this is what you want? Can you clarify please! Second mutate will pull out the p-value of the anova for each pairwise comparison.

rsample vfold_cv function not accepting .y parameter from purrr::map2

I'm trying to create nested cross-validations using the rsample package, and I use purrr::map2 to create them multiple times, with differing amount of folds as dictated by the v parameter. However, the vfold_cv function does not accept the v parameter, and instead I get this error: Error: v must be a single integer.
In the reprex below, I'm simulating the situation using the mtcars data, by creating a cross validation for each cylinder. Replacing .y with a number works, but I need the parameter to vary with each cylinder by using the n column.
library(purrr)
library(parsnip)
library(rsample)
library(tidyr)
data("mtcars")
nested <- mtcars %>%
select(cyl, disp:gear) %>%
group_by(cyl) %>%
nest(data = disp:gear) %>%
cbind(n = 2:4)
nested %>%
group_by(cyl) %>%
mutate(cv = map2(data, n,
~nested_cv(.x,
inside = vfold_cv(v = 10, repeats = 3),
outside = vfold_cv(v = .y))))
Error: `v` must be a single integer.
It's vfold_cv function inside nested_cv, you can try it:
createNested = function(x,y){
nested_cv(x,inside = vfold_cv(v = 10, repeats = 3),outside = vfold_cv(v = y))
}
createNested(nested$data[[1]],3)
Error in vfold_splits(data = data, v = v, strata = strata, breaks = breaks) :
object 'y' not found
So it cannot see the y variable (like your .y) inside the function. So I wrote a function to explicitly pass the results of vfold_cv() for outside into nested_cv(), a few more lines of code but it's ok:
createNested = function(x,y){
outside_cv = vfold_cv(x,v = y)
nested_cv(x,inside = vfold_cv(v = 10, repeats = 3),outside = outside_cv)
}
nested <- mtcars %>%
select(cyl, disp:gear) %>%
nest(data = disp:gear) %>%
mutate(n=2:4)
nested %>% mutate(cv = map2(data,n,.f=createNested))
# A tibble: 3 x 4
cyl data n cv
<dbl> <list> <int> <list>
1 6 <tibble [7 × 8]> 2 <tibble [2 × 3]>
2 4 <tibble [11 × 8]> 3 <tibble [3 × 3]>
3 8 <tibble [14 × 8]> 4 <tibble [4 × 3]>
Note, once you have nested the data, you don't need group_by()

purrr::map_df with nested data.frame

I'd like to iterate over a series of dataframes and apply the same function to them all.
I'm trying this using tidyr::nest and purrr::map_df. Here's a reprex of the sort of thing I'm trying to achieve.
data(iris)
library(purrr)
library(tidyr)
iris_df <- as.data.frame(iris)
my_var <- 2
my_fun <- function(df) {
sum_df <- sum(df) + my_var
}
iris_df %>% group_by(Species) %>% nest() %>% map_df(.$data, my_fun)
# Error: Index 1 must have length 1
What am I doing wrong? Is there a different approach?
EDIT:
To clarify my desired output. Aiming for new column containing output eg
|Species|Data|my_function_output|
|:------|:---|:-----------------|
|setosa |<tibble>|509.1 |
The problem is that nest() gives you a data.frame with a column data which is a list of data.frames. You need to map or sapply over the data column of the nest() output, not the entire nest output. I use sapply, but you could also use map_dbl. If you use map you will end up with list output, and map_df will not work because it requires named input.
iris_df %>%
group_by(Species) %>%
nest() %>%
mutate(my_fun_out = sapply(data, my_fun))
# A tibble: 3 x 3
Species data my_fun_out
<fct> <list> <dbl>
1 setosa <tibble [50 x 4]> 509
2 versicolor <tibble [50 x 4]> 717
3 virginica <tibble [50 x 4]> 859

Resources