Different function outcome based on nested structure - r

I'm trying to apply a function which looks at the structure of a nested tibble, then depending on which group the row belongs different outcomes are sought, for instance:
library(tidyverse)
df <- data.frame(
species = c(
"setosa",
"virginica",
"versicolor"
),
detection_timestamp_utc = as.POSIXct("2018-03-22 23:59:59", tz = "UTC")
) %>%
group_by(species) %>%
nest()
# idea of function is that if the group == "virginica" then a specific rule is applied whereby if the date
# within the data falls within the interval "summer_2018" should be assigned to the newly created season column
# for all other groups further rules will be applied (in my case it will be a different set of dates but I simplified it here)
# in the case of the actual function I have, there are a number of intervals per group and a number of outcomes
fun1 <- function(nested_list) {
if (nested_list$species == "virginica") {
nested_list %>% mutate(data = map(data, ~ mutate(.x, season = case_when(
detection_timestamp_utc %within% interval(
ymd_hms("2018-02-01 00:00:00"),
ymd_hms("2018-09-30 23:59:59")
) ~ "summer_2018",
"winter_2018"
))))
} else {
nested_list %>% mutate(data = map(data, ~ mutate(.x, season = "not_applicable_wrong_group")))
}
}
# from the above I would want the nested tibble row which == "virginica" to be assigned summer_2018, but all the rest
# should be assigned "not_applicable_wrong_group" as the test which looks at group should return false.
# however, it all fails to work
df2 = fun1(df)
#> Warning in if (nested_list$species == "virginica") {: the condition has length >
#> 1 and only the first element will be used
I'm not sure this is the best structure, but in theory I'll have a number of groups and slightly different rules will need to be applied to each group.

I think its probably a mistake to nest the table and then transform it different ways; I would think it would easier to simply transform it unnested. Putting that aside for a moment ...
I think you can make the following changes to your starting df and to your function
for your df , add a final step of dplyr::group_split()
this will then allow you to
purrr::map over each group and as it will have a single species value per group your existing code will likely work as you intend it.
ok here is that :
library(tidyverse)
library(lubridate)
df <- data.frame(
species = c(
"setosa",
"virginica",
"versicolor"
),
detection_timestamp_utc = as.POSIXct("2018-03-22 23:59:59", tz = "UTC")
) %>%group_by(species) %>% nest() %>% group_split()
fun1 <- function(nested_list) {
map(nested_list,~{
if (.x$species == "virginica") {
.x %>% mutate(data = map(data, ~ mutate(.x, season = case_when(
detection_timestamp_utc %within% interval(
ymd_hms("2018-02-01 00:00:00"),
ymd_hms("2018-09-30 23:59:59")
) ~ "summer_2018",
TRUE ~ "winter_2018"
))))
} else {
.x %>% mutate(data = map(data, ~ mutate(.x, season = "not_applicable_wrong_group")))
}
}) %>% bind_rows()
}
fun1(df)
but again, my advice would be to not overcomplicate by nesting; it doesnt seem particularly natural here. already you have mutate with maps inside them and other mutates... overcomplicated for the sort of scenario you are indicating
Therefore consider the relative simplicity of a non nested approach; instead of adding group_split on your df, make your df unnested. then you might do
library(tidyverse)
library(lubridate)
df <- data.frame(
species = c(
"setosa",
"virginica",
"versicolor"
),
detection_timestamp_utc = as.POSIXct("2018-03-22 23:59:59", tz = "UTC")
)
fun1 <- function(normal_df) {
normal_df %>% mutate(season = case_when(
species == "virginica" ~
case_when(
detection_timestamp_utc %within% interval(
ymd_hms("2018-02-01 00:00:00"),
ymd_hms("2018-09-30 23:59:59")
) ~ "summer_2018",
TRUE ~ "winter_2018"
),
TRUE ~ "not_applicable_wrong_group"
))
}
fun1(df)
#or if you must
fun1(df) %>% group_by(species) %>% nest()

Related

Set up ouput-list names in slider::slide_index()

slider::slider_index() uses a purrr::map()-like syntax to apply functions on rolling (here time) window. The output of this function is a list. My question is, how to set up the names of this list?
The slider_index() function does not have a .names_to argument like slider_index_dfr().
See reproducible example below:
library(slider)
library(lubridate)
library(dplyr)
storms_summary <- storms %>%
# filter dataset so it does not eat all memory and save computing time ;)
filter(year <= 2000 & year >= 1999) %>%
# make a date column to mimic my real data
mutate(storm_date = as.Date(paste(year, month, day, sep = "-"))) %>%
arrange(storm_date) %>%
slider::slide_index(
.x = .,
.i = .$storm_date,
.f = ~ summary(.x),
.after = lubridate::weeks(2),
.complete = TRUE
)
Expected output is the list storms_summary with names that are character strings indicating the first and the last date of the period summarized. With the code below we get:
> names(storms_summary)
NULL
If there is no way to assign the names within the function we can create names on our own and use setNames to assign it.
library(slider)
library(lubridate)
library(dplyr)
storms_summary <- storms %>%
# filter dataset so it does not eat all memory and save computing time ;)
filter(year <= 2000 & year >= 1999) %>%
# make a date column to mimic my real data
mutate(storm_date = as.Date(paste(year, month, day, sep = "-"))) %>%
arrange(storm_date) %>%
mutate(sliding_output = slider::slide_index(.x = cur_data(),
.i = storm_date,
.f = ~summary(.x),
.after = lubridate::weeks(2),
.complete = TRUE
),
names = slider::slide_index_chr(.x = storm_date,
.i = storm_date,
.f = ~paste0(range(.x), collapse = '-'),
.after = lubridate::weeks(2)),
sliding_output = setNames(sliding_output, names)) %>%
select(-names)
names(storms_summary$sliding_output)
# [1] "1999-07-02-1999-07-03" "1999-07-03-1999-07-03" "1999-07-03-1999-07-03"
# [4] "1999-07-03-1999-07-03" "1999-08-24-1999-09-07" "1999-08-24-1999-09-07"
# [7] "1999-08-24-1999-09-07" "1999-08-25-1999-09-08" "1999-08-25-1999-09-08"
#[10] "1999-08-25-1999-09-08" "1999-08-25-1999-09-08" "1999-08-26-1999-09-09"
#...
#...

How to keep a chain processing going referring to unnamed outputs in an intermediate process in R?

I'm trying to connect multiple processes with the native pipe |> in R.
In the following MWE,
I give the data frame iris to the first and second process simultaneously in (\(x){...})();
In (\(x){...})(), the output of the first process is named as setosa.Sepal.Length.np and that of the second process is named as versicolor.Sepal.Length.np by the double arrow assignment operator ->>;
The correlation between the data of these two outputs is calculated with cor(), and this calculation is not directly connected to the previous chain-process.
iris |>
(\(x){
## First process
filter(
x,
Species == "setosa"
) |>
dplyr::select(Sepal.Length) ->>
setosa.Sepal.Length.np
## Second process
filter(
x,
Species == "versicolor"
) |>
dplyr::select(Sepal.Length) ->>
versicolor.Sepal.Length.np
})()
cor(setosa.Sepal.Length.np, versicolor.Sepal.Length.np)
I want to directly connect the process of correlation calculation to the previous processes with |>. To do so, I have to remain the output of the first and second process unnamed, and I must not create the object setosa.Sepal.Length.np and versicolor.Sepal.Length.np. However, how should I refer to the output of the first and second process, then?
iris |>
(\(x){
filter(
x,
Species == "setosa"
) %>%
dplyr::select(Sepal.Length) # remain the first output unnamed
filter(
x,
Species == "versicolor"
) %>%
dplyr::select(Sepal.Length) # remain the second output unnamed
})() |> # send the outputs in this process to the next `cor()`
cor(????, ????) # How should I refer to the first and second output here?
Supplement code
The first MWE is equivalent to the following codes with magrittr's %>%.
require(magrittr)
iris %>%
{
filter(
.,
Species == "setosa"
) %>%
dplyr::select(Sepal.Length) ->>
setosa.Sepal.Length
filter(
.,
Species == "versicolor"
) %>%
dplyr::select(Sepal.Length) ->>
versicolor.Sepal.Length
}
cor(setosa.Sepal.Length, versicolor.Sepal.Length)
With the native pipe you can only pipe values into the first parameter. You can never pipe values into the second parameter of cor(). There's just no way to do that. You'd need an intermediate function that could take a list of values and return a list from the previous step. For example here we can create a helper function corlist to accept the list.
corlist <- function(x) cor(x[[1]], x[[2]])
iris |>
(\(x){
list(
filter(
x,
Species == "setosa"
) %>%
dplyr::select(Sepal.Length),
filter(
x,
Species == "versicolor"
) %>%
dplyr::select(Sepal.Length)
)
})() |>
corlist()
# Sepal.Length
# Sepal.Length -0.08084973
If you must call variable names specifically, one possible way using with
library(dplyr)
iris %>% {
list(
subset(., Species == 'setosa', select = 'Sepal.Length') %>%
rename(., 'setosa.Sepal.Length' = 'Sepal.Length'),
subset(., Species == 'versicolor', select = 'Sepal.Length') %>%
rename(., 'versicolor.Sepal.Length' = 'Sepal.Length')
)
} %>%
do.call(cbind, .) %>% with(., cor(setosa.Sepal.Length, versicolor.Sepal.Length))
If only cor is necessary, can codegolf this a bit and avoid use of dplyr:
iris %>% {
list(
subset(., Species == 'setosa', select = 'Sepal.Length'),
subset(., Species == 'versicolor', select = 'Sepal.Length')
)
} %>%
do.call(cbind, .) %>% cor # or {cor(.[,1],.[,2])} although less elegant

Change iteration number in one Dplyr command

I need to be able to change iteration number in each seperated line of one dplyr code.
I have prepared and example of 'by hand' approach and what I need to do in 'pseudo steps'.
library(tidyverse)
cr <-
mtcars %>%
group_by(gear) %>%
nest()
# This is 'by-hand' approach of what I would like to do - How to automate it? E.g. we do not know all values of 'carb'
cr$data[[1]] %>%
mutate(VARIABLE1 =
case_when(carb == 1 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE2 =
case_when(carb == 2 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE4 =
case_when(carb == 4 ~ hp/mpg,
TRUE ~ 0))
# This is a pseodu-idea of what I need to do. Is the any way how to change iteration number in ONE dplyr code?
vals <- cr$data[[1]] %>% pull(carb) %>% sort %>% unique()
for (i in vals) {
message(i)
cr$data[[1]] %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be first element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be second element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be third element of vals
TRUE ~ 0))
}
is there any trick maybe using purrr package or other solution as well?
I need to iterate over some unique values of some variable. And for each unique value create a new column in dataframe. I need to automatize this, however I am not able to do so on my own.
You can do this using sym to convert text to symbols and !! to evaluate within dplyr functions. See this question and this vignette for further details.
For your application, you probably want something like this:
carbs = c(1,2,4)
for(cc in carbs){
var_name = sym(paste0("VARIABLE",cc))
cr$data[[1]] = cr$data[[1]] %>%
mutate(!!var_name := case_when(carb == cc ~ hp/mpg,
TRUE ~ 0))
}
There are three key parts to this:
sym turns the text string into a symbol variable.
!! means that a symbol is treated as an R expression
:= lets us use !! evaluation on the left-hand-side of the equation

Automate Dplyr's mutate function

What is the best way to automate mutate function in one dplyr aggregation.
Best if I demonstrate on the example.
So in the first part of an example I am creating new columns based on values of variable gear. However, imagine I need to automate this step to automatically 'iterate' over all unique values of gear and creates new columns for each value.
Is there any how to do to so?
library(tidyverse)
cr <-
mtcars %>%
group_by(gear) %>%
nest()
# This is 'by-hand' approach of what I would like to do - How to automate it? E.g. we do not know all values of 'carb'
cr$data[[1]] %>%
mutate(VARIABLE1 =
case_when(carb == 1 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE2 =
case_when(carb == 2 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE4 =
case_when(carb == 4 ~ hp/mpg,
TRUE ~ 0))
# This is a pseodu-idea of what I need to do. Is the any way how to change iteration number in ONE dplyr code?
vals <- cr$data[[1]] %>% pull(carb) %>% sort %>% unique()
for (i in vals) {
message(i)
cr$data[[1]] %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be first element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be second element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be third element of vals
TRUE ~ 0))
}
One way would be to use dummy_cols from package fastDummies
Doing it for one dataframe at a time:
cr$data[[1]] %>%
dummy_cols(select_columns = 'carb')%>%
mutate_at(vars(starts_with('carb_')),funs(.*hp/mpg))
You can also do this first and the group by gear since you are not using gear value in calculation so it wouldn't matter. For that:
cr_new=mtcars %>%
dummy_cols(select_columns = 'carb')%>%
mutate_at(vars(starts_with('carb_')),funs(.*hp/mpg))%>%
group_by(gear)%>%
nest()
Perhaps, something like this would help -
library(dplyr)
library(purrr)
bind_cols(mtcars, map_dfc(unique(mtcars$carb),
~mtcars %>%
transmute(!!paste0('carb', .x) := case_when(carb == 1 ~ hp/mpg,TRUE ~ 0))))
It sounds a lot like what's called "the XY-problem".
https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
Please read about tidy data, and/or tidyr's pivot_longer/pivot_wider. Column names should not encode information.

dplyr self-join with filter

I want to subtract values from a row with label "baseline" from all the otherwise marked items in a long format data frame. It is easy to do this in two steps using a left_join with the "baseline" subset. However, I could not figure out how to combine vas_1 and vas_diff into one chain.
library(dplyr)
# Create test data
n_users = 5
vas = data_frame(
user = rep(letters[1:n_users], each = 3),
group = rep(c("baseline", "early", "late" ),n_users),
vas = round(rgamma(n_users*3, 10,1.4 ))
)
# The above data are given
# Assume some other operations are required
vas_1 = vas %>%
mutate(
vas = vas * 2
)
# I want to put the following into one
# chain with the above
# Use self-join to subtract baseline
vas_diff = vas_1 %>%
filter(group != "baseline") %>%
# Problem is vas_1 here. Using . gives error here
# Adding copy = TRUE does not help
# left_join(. %>% filter(group == "baseline") , by = c("user")) %>%
left_join(vas_1 %>% filter(group == "baseline") , by = c("user")) %>%
mutate(vas = vas.x - vas.y) %>% # compute offset
select(user, group.x, vas) # remove temporary variables
vas_diff
I use anonymous function when . should be used multiple times:
... %>% (function(df) { ... }) %>% ...
Hence, in your case:
vas_diff = vas_1 %>%
filter(group != "baseline") %>%
(function(df) left_join(df, df %>% filter(group == "baseline") , by = c("user"))) %>%
mutate(vas = vas.x - vas.y) %>% # compute offset
select(user, group.x, vas)
(which is not going produce desirable result as describe in comments above, but you it shows how to use anonymous function)
but probably you want this:
vas_diff = vas_1 %>%
left_join(
x = filter(., group != "baseline")
, y = filter(., group == "baseline")
, by = c("user")
) %>%
mutate(vas = vas.x - vas.y) %>% # compute offset
select(user, group.x, vas) # remove temporary variables
Here's a similar option, and a demonstration that you can pass whole pipe chains as arguments to the join. Instead of moving the . inside filter, you can pass . as an argument to eval, then remove unwanted columns in the right-hand side. This is largely just to document this approach for my own purposes.
vas_diff = vas_1 %>%
left_join(x = eval(.) %>%
filter(group != "baseline"),
y = eval(.) %>%
filter(group == "baseline") %>%
select(-group),
by = c("user")) %>%
mutate(vas = vas.x - vas.y) %>% # compute offset
select(user, group, vas)
Does anybody know why you can't simply pass the ., like x = . %>% filter ...... Why do we need to eval(.).

Resources