purrr::map_df with nested data.frame - r

I'd like to iterate over a series of dataframes and apply the same function to them all.
I'm trying this using tidyr::nest and purrr::map_df. Here's a reprex of the sort of thing I'm trying to achieve.
data(iris)
library(purrr)
library(tidyr)
iris_df <- as.data.frame(iris)
my_var <- 2
my_fun <- function(df) {
sum_df <- sum(df) + my_var
}
iris_df %>% group_by(Species) %>% nest() %>% map_df(.$data, my_fun)
# Error: Index 1 must have length 1
What am I doing wrong? Is there a different approach?
EDIT:
To clarify my desired output. Aiming for new column containing output eg
|Species|Data|my_function_output|
|:------|:---|:-----------------|
|setosa |<tibble>|509.1 |

The problem is that nest() gives you a data.frame with a column data which is a list of data.frames. You need to map or sapply over the data column of the nest() output, not the entire nest output. I use sapply, but you could also use map_dbl. If you use map you will end up with list output, and map_df will not work because it requires named input.
iris_df %>%
group_by(Species) %>%
nest() %>%
mutate(my_fun_out = sapply(data, my_fun))
# A tibble: 3 x 3
Species data my_fun_out
<fct> <list> <dbl>
1 setosa <tibble [50 x 4]> 509
2 versicolor <tibble [50 x 4]> 717
3 virginica <tibble [50 x 4]> 859

Related

Unable to access nested data elements inside mutate

I am trying to understand why the following code doesn't work. My understanding is it will take data$Sepal.Length (element within the nested data column) and iterate that one(the vector) over the function sum.
df <- iris %>%
nest(-Species) %>%
mutate(Total.Sepal.Length = map_dbl(data$Sepal.Length, sum, na.rm = TRUE))
print(df)
But this throws an error Total.Sepal.Length must be size 3 or 1, not 0. The following code works by using anonymous function as how it is usually accessed
df <- iris %>%
nest(-Species) %>%
mutate(Total.Sepal.Length = map_dbl(data, function(x) sum(x$Sepal.Length, na.rm = TRUE)))
print(df)
I am trying to understand why the previous code didn't work even though I am correctly passing arguments to mutate and map.
You should do this:
df <- iris %>%
nest(-Species) %>%
mutate(Total.Sepal.Length = map_dbl(data, ~sum(.x$Sepal.Length, na.rm = TRUE)))
Two things: any reason you're not using group_by?
Second: your initial mutate is trying to perform:
map_dbl(df$data$Sepal.Length, sum, na.rm = TRUE)
Which brings an empty result, because df$data$Total.Sepal.Length is NULL (you have to access each list element to access the columns, so df$data[[1]]$Total.Sepal.Length works)
Output:
# A tibble: 3 × 3
Species data Total.Sepal.Length
<fct> <list> <dbl>
1 setosa <tibble [50 × 4]> 250.
2 versicolor <tibble [50 × 4]> 297.
3 virginica <tibble [50 × 4]> 329.

run model for each line of model parameters (meta) data.frame

In the spirit of purr, broom, modelr, I am trying to create a "meta" data.frame in which each row denotes the dataset (d) and the model parameters (yvar, xvars, FEvars). For instance:
iris2 <- iris %>% mutate(Sepal.Length=Sepal.Length^2)
meta <- data.frame(n=1:4,
yvar = c('Sepal.Length','Sepal.Length','Sepal.Length','Sepal.Length'),
xvars= I(list(c('Sepal.Width'),
c('Sepal.Width','Petal.Length'),
c('Sepal.Width'),
c('Sepal.Width','Petal.Length'))),
data= I(list(iris,iris,iris2,iris2)) )
Now, I would like to run a model for each column of "meta". And then add a list column "model" with the model output object. To run the model I use an auxiliary function that uses a dataset, a y variable and a vector of x variables:
OLS_help <- function(d,y,xvars){
paste(y, paste(xvars, collapse=" + "), sep=" ~ ") %>% as.formula %>%
lm(d)
}
y <- 'Sepal.Length'
xvars <- c('Sepal.Width','Petal.Length')
OLS_help(iris,y,xvars)
How can I execute OLS_help for all the rows of meta and adding the output of OLS_help as a list column in meta? I tryed the following code, but it did not work:
meta %>% mutate(model = map2(d,yvar,xvars,OLS_help) )
Error: Can't convert a `AsIs` object to function
Call `rlang::last_error()` to see a backtrace
OBS: The solution to when only the "data" (nested) list column (corvered in Hadley's book here) is:
by_country <- gapminder %>% group_by(country, continent) %>% nest()
country_model <- function(df) { lm(lifeExp ~ year, data = df) }
by_country <- by_country %>% mutate(model = map(data, country_model))
We can use pmap in the following way
df <- meta %>%
as_tibble() %>%
mutate_if(is.factor, as.character) %>%
mutate(fit = pmap(
list(yvar, xvars, data),
function(y, x, df) lm(reformulate(x, response = y), data = df)))
## A tibble: 4 x 5
# n yvar xvars data fit
# <int> <chr> <I<list>> <I<list>> <list>
#1 1 Sepal.Length <chr [1]> <df[,5] [150 × 5]> <lm>
#2 2 Sepal.Length <chr [2]> <df[,5] [150 × 5]> <lm>
#3 3 Sepal.Length <chr [1]> <df[,5] [150 × 5]> <lm>
#4 4 Sepal.Length <chr [2]> <df[,5] [150 × 5]> <lm>
Explanation: pmap iterates over multiple arguments simultaneously (similar to base R's Map); here we simultaneously loop throw entries in column yvar, xvar and data, then use reformulate to construct the formula to be used within lm. We store the lm fit object in column fit.

Join tibbles in list to one tibble [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 3 years ago.
I have a list of two data frames
a = list(
mtcars %>% as_tibble() %>% select(-vs),
mtcars %>% as_tibble() %>% sample_n(17)
)
and add a new column to the data sets by
b = a %>%
map(~ mutate(.x, class = floor(runif(nrow(.x), 0, 2)))) %>%
map(~ nest(.x, -class))
Now I want to join the two list elements to one tibble based on class. Specifically, I am looking for a "smoother" solution than inner_join(pluck(b, 1), pluck(b, 2), "class") which gives the desired results but quickly gets messy if more data sets are involved in the list a.
This question is not super clear, but it seemed like there might be enough use cases to go for it. I added a few more data frames to a, constructed similarly, because the sample you used is too small to really see what you need to deal with.
library(tidyverse)
set.seed(123)
a <- list(
mtcars %>% as_tibble() %>% select(-vs),
mtcars %>% as_tibble() %>% sample_n(17),
mtcars %>% as_tibble() %>% slice(1:10),
mtcars %>% as_tibble() %>% select(mpg, cyl, disp)
)
# same construction of b as in the question
You can use purrr::reduce to carry out the inner_join call repeatedly, returning a single data frame of nested data frames. That's straightforward enough, but I couldn't figure out a good way to supply the suffix argument to the join, which assigns .x and .y by default to differentiate between duplicate column names. So you get these weird names:
b %>%
reduce(inner_join, by = "class")
#> # A tibble: 2 x 5
#> class data.x data.y data.x.x data.y.y
#> <dbl> <list> <list> <list> <list>
#> 1 1 <tibble [11 × 10… <tibble [8 × 11… <tibble [3 × 11… <tibble [17 × …
#> 2 0 <tibble [21 × 10… <tibble [9 × 11… <tibble [7 × 11… <tibble [15 × …
You could probably deal with the names by creating something like data1, data2, etc before the reduce, but the quickest thing I decided on was replacing the suffixes with just the index of each data frame from the list b. A more complicated naming scheme would be a task for a different question.
b %>%
reduce(inner_join, by = "class") %>%
rename_at(vars(starts_with("data")),
str_replace, "(\\.\\w)+$", as.character(1:length(b))) %>%
names()
#> [1] "class" "data1" "data2" "data3" "data4"

How to write to disk data from each element of nested data frame in R?

My question is directly related to this one: In R, write each nested data frame to a CSV, but I am not able to get the solution to work and would like to avoid needing to install the extra required package purrrlyr.
I need to write each element (data.frame) of a nested data.frame to a table, with the name of each element corresponding to the first column of the nested data.frame:
ir <- iris %>% group_by(Species) %>% nest()
ir$Species <- as.character(ir$Species)
A tibble: 3 x 2
Species data
<chr> <list>
1 setosa <tibble [50 x 4]>
2 versicolor <tibble [50 x 4]>
3 virginica <tibble [50 x 4]>
I tried the linked solution:
temp <- ir %>% purrrlyr::by_row(~write.csv(.$data, file = .$Species))
But receive the following error:
Error in by_row(., ~write.csv(.$data, file = .$Species)) :
STRING_PTR() can only be applied to a 'character', not a 'list'
I have read about purrr::walk but I can't seem to figure out how to implement it.
We could use map2
library(purrr)
map2(ir$data, ir$Species, ~ write.csv(.x, file = paste0(.y, ".csv")))
If we don't want the NULL output message on console, use iwalk
iwalk(setNames(ir$data, ir$Species), ~ write.csv(.x, file = paste0(.y, ".csv")))

List Columns - Creating a data frame of data frames

I'd like to create a pretty simple data frame of data frames. I'd like the master data frame to have 100 rows, with two columns. One column is called "row" and has the numbers 1-100 and two other column called "df1" and "df2" that are each a data frame with one column "row" and the numbers 1-100. I've tried the following:
mydf <- data.frame(row=1:100)
for(i in 1:100){
mydf$df1[i] <- data.frame(row=1:100)
mydf$df2[i] <- data.frame(row=1:100)
}
But that creates lists not data frames and the columns are unnamed. I also tried:
mydf <- data.frame(row=1:100)
mydf <- mydf %>% mutate(df1=data.frame(row=1:100),df2=data.frame(row=1:100))
But that throws an error. It seems like what I'm doing shouldn't be too difficult, what am I doing wrong and how can I accomplish this?
Thanks.
Use do on a per-row basis instead of mutate:
mydf <- data.frame( row = 1:100 ) %>% group_by(row) %>%
do( df1 = data.frame(row=1:100), df2 = data.frame(row=1:100) ) %>% ungroup
# # A tibble: 100 x 3
# row df1 df2
# <int> <list> <list>
# 1 1 <data.frame [100 x 1]> <data.frame [100 x 1]>
# 2 2 <data.frame [100 x 1]> <data.frame [100 x 1]>
# 3 3 <data.frame [100 x 1]> <data.frame [100 x 1]>
# ...
You can use replicate to achieve that, i.e.
mydf$df1 <- replicate(100, mydf)
mydf$df2 <- replicate(nrow(mydf), mydf) #I used nrow here to make it more generic
I think you should used nested data frames as described in
https://www.rdocumentation.org/packages/tidyr/versions/0.6.1/topics/nest
But for what you ask, you need the operator I.
mydf <- data.frame(row=1:100)
for(i in 1:100){
mydf$df1[i] <- I(data.frame(row=1:100))
mydf$df2[i] <- I(data.frame(row=1:100))
}
show(mydf)
mydf$df1

Resources