Unable to access nested data elements inside mutate - r

I am trying to understand why the following code doesn't work. My understanding is it will take data$Sepal.Length (element within the nested data column) and iterate that one(the vector) over the function sum.
df <- iris %>%
nest(-Species) %>%
mutate(Total.Sepal.Length = map_dbl(data$Sepal.Length, sum, na.rm = TRUE))
print(df)
But this throws an error Total.Sepal.Length must be size 3 or 1, not 0. The following code works by using anonymous function as how it is usually accessed
df <- iris %>%
nest(-Species) %>%
mutate(Total.Sepal.Length = map_dbl(data, function(x) sum(x$Sepal.Length, na.rm = TRUE)))
print(df)
I am trying to understand why the previous code didn't work even though I am correctly passing arguments to mutate and map.

You should do this:
df <- iris %>%
nest(-Species) %>%
mutate(Total.Sepal.Length = map_dbl(data, ~sum(.x$Sepal.Length, na.rm = TRUE)))
Two things: any reason you're not using group_by?
Second: your initial mutate is trying to perform:
map_dbl(df$data$Sepal.Length, sum, na.rm = TRUE)
Which brings an empty result, because df$data$Total.Sepal.Length is NULL (you have to access each list element to access the columns, so df$data[[1]]$Total.Sepal.Length works)
Output:
# A tibble: 3 × 3
Species data Total.Sepal.Length
<fct> <list> <dbl>
1 setosa <tibble [50 × 4]> 250.
2 versicolor <tibble [50 × 4]> 297.
3 virginica <tibble [50 × 4]> 329.

Related

rsample vfold_cv function not accepting .y parameter from purrr::map2

I'm trying to create nested cross-validations using the rsample package, and I use purrr::map2 to create them multiple times, with differing amount of folds as dictated by the v parameter. However, the vfold_cv function does not accept the v parameter, and instead I get this error: Error: v must be a single integer.
In the reprex below, I'm simulating the situation using the mtcars data, by creating a cross validation for each cylinder. Replacing .y with a number works, but I need the parameter to vary with each cylinder by using the n column.
library(purrr)
library(parsnip)
library(rsample)
library(tidyr)
data("mtcars")
nested <- mtcars %>%
select(cyl, disp:gear) %>%
group_by(cyl) %>%
nest(data = disp:gear) %>%
cbind(n = 2:4)
nested %>%
group_by(cyl) %>%
mutate(cv = map2(data, n,
~nested_cv(.x,
inside = vfold_cv(v = 10, repeats = 3),
outside = vfold_cv(v = .y))))
Error: `v` must be a single integer.
It's vfold_cv function inside nested_cv, you can try it:
createNested = function(x,y){
nested_cv(x,inside = vfold_cv(v = 10, repeats = 3),outside = vfold_cv(v = y))
}
createNested(nested$data[[1]],3)
Error in vfold_splits(data = data, v = v, strata = strata, breaks = breaks) :
object 'y' not found
So it cannot see the y variable (like your .y) inside the function. So I wrote a function to explicitly pass the results of vfold_cv() for outside into nested_cv(), a few more lines of code but it's ok:
createNested = function(x,y){
outside_cv = vfold_cv(x,v = y)
nested_cv(x,inside = vfold_cv(v = 10, repeats = 3),outside = outside_cv)
}
nested <- mtcars %>%
select(cyl, disp:gear) %>%
nest(data = disp:gear) %>%
mutate(n=2:4)
nested %>% mutate(cv = map2(data,n,.f=createNested))
# A tibble: 3 x 4
cyl data n cv
<dbl> <list> <int> <list>
1 6 <tibble [7 × 8]> 2 <tibble [2 × 3]>
2 4 <tibble [11 × 8]> 3 <tibble [3 × 3]>
3 8 <tibble [14 × 8]> 4 <tibble [4 × 3]>
Note, once you have nested the data, you don't need group_by()

R Dplyr: How to replace items in a vector matching items in a list, conditionally

As requested by a collaborator, I am trying to create a second version of a dataset with outliers removed. I have data with multiple groups (factors) and multiple numeric response variables. I want to write a function that (1) finds outliers and extremes by group using the 1.5*IQR and 3* IQR methods, (2) counts the outliers, and (3) if the number of outliers is greater than 2, replaces values for EXTREMES only with NA, but if the number of outliers is less than or equal to 2, replaces values for OUTLIERS with NA.
Because this is a grouped calculation, I have opted to use dplyr. I am hoping to apply this function to each of the several response variables in my dataset. I have achieved steps (1) and (2) with this method, but am facing the following issues with step (3).
Steps 1 and 2:
require(dplyr)
# Find outliers and extremes for one response variable by group. Mark if number of outliers per group is >2.
# List outliers and extremes.
out_ext_num <- iris %>%
group_by(Species) %>%
mutate(is_outlier = (Sepal.Length < summary(Sepal.Length)[2] - (1.5*IQR(Sepal.Length)) |
(Sepal.Length > ((1.5*IQR(Sepal.Length)) + summary(Sepal.Length)[5]))),
is_extreme = (Sepal.Length < (summary(Sepal.Length)[2] - (3*IQR(Sepal.Length)))) |
(Sepal.Length > ((3*IQR(Sepal.Length)) + summary(Sepal.Length)[5]))) %>%
summarise(out_num2 = sum(is_outlier) > 2, outliers = list(Sepal.Length[is_outlier == T]),
extremes = list(Sepal.Length[is_extreme == T]))
# A tibble: 3 x 4
Species out_num2 outliers extremes
<fct> <lgl> <list> <list>
1 setosa FALSE <dbl [0]> <dbl [0]>
2 versicolor FALSE <dbl [0]> <dbl [0]>
3 virginica FALSE <dbl [1]> <dbl [0]>
I can combine this with a nested version of my data so that the groups line up:
(EDITED)
nested <- iris %>%
select(Species, Sepal.Length) %>%
group_by(Species) %>%
nest() %>%
left_join(out_ext_num)
# A tibble: 3 x 5
# Groups: Species [3]
Species data out_num2 outliers extremes
<fct> <list> <lgl> <list> <list>
1 setosa <tibble [50 x 1]> FALSE <dbl [0]> <dbl [0]>
2 versicolor <tibble [50 x 1]> FALSE <dbl [0]> <dbl [0]>
3 virginica <tibble [50 x 1]> FALSE <dbl [1]> <dbl [0]>
Now, for each group, if out_num2 is FALSE, I want to match values from the data tibble (containing vector Sepal.Length) with values from the outliers list. If they match, I want to replace that value in the tibble with NA. If out_num2 is TRUE, I want to match values from the data tibble with values from the extremes list and replace corresponding values in the tibble with NA. Right now, I can't even match values from the list with the tibble, period, let alone do it conditionally. I have tried using %in% within mutate() and map(), but sense that I am incorrectly referencing the Sepal.Length vector within the tibble:
require(purrr)
nested %>%
mutate(Sepal.Length.o = map(data, ~ ifelse(Sepal.Length[.x %in% nested$outliers], NA, Sepal.Length)))
I've reviewed list referencing and several posts on matching values from different sites, but nothing quite works with this dplyr format. Is there another or better way to do this?
After figuring out how to match, I was planning to use nested ifelse statements to conditionally replace outlier or extreme values with NA, perhaps similar to this:
nested %>%
mutate(Sepal.Length.o = ifelse(out_num2 == T,
ifelse(match_tibble_with_extremes, NA, Sepal.Length),
ifelse(match_tibble_with_outliers, NA, Sepal.Length)))
But I'm not sure if that will work as I intend. Any help, especially with the matching step, would be greatly appreciated.
Here is one option
library(dplyr)
library(purrr)
nested %>%
rowwise %>%
mutate(data = map2(data, if(out_num2) list(extremes) else list(outliers), ~ replace(.x, .x %in% .y, NA)))

Join tibbles in list to one tibble [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 3 years ago.
I have a list of two data frames
a = list(
mtcars %>% as_tibble() %>% select(-vs),
mtcars %>% as_tibble() %>% sample_n(17)
)
and add a new column to the data sets by
b = a %>%
map(~ mutate(.x, class = floor(runif(nrow(.x), 0, 2)))) %>%
map(~ nest(.x, -class))
Now I want to join the two list elements to one tibble based on class. Specifically, I am looking for a "smoother" solution than inner_join(pluck(b, 1), pluck(b, 2), "class") which gives the desired results but quickly gets messy if more data sets are involved in the list a.
This question is not super clear, but it seemed like there might be enough use cases to go for it. I added a few more data frames to a, constructed similarly, because the sample you used is too small to really see what you need to deal with.
library(tidyverse)
set.seed(123)
a <- list(
mtcars %>% as_tibble() %>% select(-vs),
mtcars %>% as_tibble() %>% sample_n(17),
mtcars %>% as_tibble() %>% slice(1:10),
mtcars %>% as_tibble() %>% select(mpg, cyl, disp)
)
# same construction of b as in the question
You can use purrr::reduce to carry out the inner_join call repeatedly, returning a single data frame of nested data frames. That's straightforward enough, but I couldn't figure out a good way to supply the suffix argument to the join, which assigns .x and .y by default to differentiate between duplicate column names. So you get these weird names:
b %>%
reduce(inner_join, by = "class")
#> # A tibble: 2 x 5
#> class data.x data.y data.x.x data.y.y
#> <dbl> <list> <list> <list> <list>
#> 1 1 <tibble [11 × 10… <tibble [8 × 11… <tibble [3 × 11… <tibble [17 × …
#> 2 0 <tibble [21 × 10… <tibble [9 × 11… <tibble [7 × 11… <tibble [15 × …
You could probably deal with the names by creating something like data1, data2, etc before the reduce, but the quickest thing I decided on was replacing the suffixes with just the index of each data frame from the list b. A more complicated naming scheme would be a task for a different question.
b %>%
reduce(inner_join, by = "class") %>%
rename_at(vars(starts_with("data")),
str_replace, "(\\.\\w)+$", as.character(1:length(b))) %>%
names()
#> [1] "class" "data1" "data2" "data3" "data4"

How to write to disk data from each element of nested data frame in R?

My question is directly related to this one: In R, write each nested data frame to a CSV, but I am not able to get the solution to work and would like to avoid needing to install the extra required package purrrlyr.
I need to write each element (data.frame) of a nested data.frame to a table, with the name of each element corresponding to the first column of the nested data.frame:
ir <- iris %>% group_by(Species) %>% nest()
ir$Species <- as.character(ir$Species)
A tibble: 3 x 2
Species data
<chr> <list>
1 setosa <tibble [50 x 4]>
2 versicolor <tibble [50 x 4]>
3 virginica <tibble [50 x 4]>
I tried the linked solution:
temp <- ir %>% purrrlyr::by_row(~write.csv(.$data, file = .$Species))
But receive the following error:
Error in by_row(., ~write.csv(.$data, file = .$Species)) :
STRING_PTR() can only be applied to a 'character', not a 'list'
I have read about purrr::walk but I can't seem to figure out how to implement it.
We could use map2
library(purrr)
map2(ir$data, ir$Species, ~ write.csv(.x, file = paste0(.y, ".csv")))
If we don't want the NULL output message on console, use iwalk
iwalk(setNames(ir$data, ir$Species), ~ write.csv(.x, file = paste0(.y, ".csv")))

purrr::map_df with nested data.frame

I'd like to iterate over a series of dataframes and apply the same function to them all.
I'm trying this using tidyr::nest and purrr::map_df. Here's a reprex of the sort of thing I'm trying to achieve.
data(iris)
library(purrr)
library(tidyr)
iris_df <- as.data.frame(iris)
my_var <- 2
my_fun <- function(df) {
sum_df <- sum(df) + my_var
}
iris_df %>% group_by(Species) %>% nest() %>% map_df(.$data, my_fun)
# Error: Index 1 must have length 1
What am I doing wrong? Is there a different approach?
EDIT:
To clarify my desired output. Aiming for new column containing output eg
|Species|Data|my_function_output|
|:------|:---|:-----------------|
|setosa |<tibble>|509.1 |
The problem is that nest() gives you a data.frame with a column data which is a list of data.frames. You need to map or sapply over the data column of the nest() output, not the entire nest output. I use sapply, but you could also use map_dbl. If you use map you will end up with list output, and map_df will not work because it requires named input.
iris_df %>%
group_by(Species) %>%
nest() %>%
mutate(my_fun_out = sapply(data, my_fun))
# A tibble: 3 x 3
Species data my_fun_out
<fct> <list> <dbl>
1 setosa <tibble [50 x 4]> 509
2 versicolor <tibble [50 x 4]> 717
3 virginica <tibble [50 x 4]> 859

Resources