Writing dplyr function to pass to group_map - r

The problem in question would be to apply the function f to each group of a tibble. It is a simpler way to do this, but I would like to solve the problem using the group_map() function.
Data used: starwars of the dplyr package.
What I want is to get an average of the height variable for a grouped tibble considering the variables gender and species. I know the problem could be easily solved by doing:
starwars %>% group_by(gender, species) %>%
summarise(mean = mean(height, na.rm = TRUE))
However, my desire is to implement summarise(mean = mean(height, na.rm = TRUE)) in a function and send to group_map().
I tried to create the f() function that gets the data argument which is a tibble object with the previously defined groups. The second argument of the f() function would be ... so that I could pass the variables of interest from data to f().
f <- function(dados, ...){
dados %>% summarise(mean = mean(..., na.rm = TRUE))
}
starwars %>% group_by(gender, species) %>%
group_map(.tbl = ., .f = ~f(dados = .x), height)

Solutions:
func_1 <- function(dados, var, ...){
var_interesse <- enquo(var)
dots <- enquos(...)
# Could be attributed direct reference ...
dados %>% group_by(!!!dots) %>%
summarise(media = mean(x = !!var_interesse, na.rm = TRUE))
}
starwars %>% func_1(var = height, gender, species)
or
func_2 <- function(dados, var){
var_interesse <- enquo(var)
#dots <- enquos(...)
dados %>% summarise(media = mean(x = !!var_interesse, na.rm = TRUE))
}
agrupamento <- starwars %>% group_by(gender, species)
agrupamento %>%
group_map(.tbl = ., .f = ~func_2(dados = .x, var = height))

Related

Deploying arrange(desc(.)) on each variable passed previously via enquos

Background
Using rlang I've a simple summary function for dplyr that counts a number of missing observations within a variable per provided groups. I would like to return the results in a descending order of grouping variables.
Sample data
library("tidyverse")
set.seed(123)
test_data <- tibble(dates = seq.Date(
from = as.Date.character(x = "01-01-2000", format = "%d-%m-%Y"),
to = as.Date.character(x = "31-12-2010", format = "%d-%m-%Y"),
by = "day"
)) %>%
transmute(
t_year = lubridate::year(dates),
t_mnth = lubridate::month(dates),
t_day = lubridate::day(dates),
tst_var = if_else(rnorm(n()) > .8, NA_real_, rnorm(n()))
)
Summary function
Working version
quick_smry <- function(df, x, ...) {
group_by_vars <- enquos(...)
check_var <- enquo(x)
df %>%
group_by(!!!group_by_vars) %>%
summarise(num_missing = sum(is.na(!!check_var)))
}
Desired results
test_data %>%
group_by(t_year, t_mnth) %>%
summarise(num_missing = sum(is.na(tst_var))) %>%
arrange(desc(t_year), desc(t_mnth))
Problem
Implementing arrange(desc(x)) call so it can handle each of the variables passed initially via enquos. I.e. if there are 5 grouping variables passed via in enquos this should be equivalent of arrange(desc(var1)) .... arrange(desc(var5)).
Attempt
Naturally, this doesn't work:
quick_smry <- function(df, x, ...) {
group_by_vars <- enquos(...)
check_var <- enquo(x)
df %>%
group_by(!!!group_by_vars) %>%
summarise(num_missing = sum(is.na(!!check_var))) %>%
# Desc call should be created for each of the group variables
arrange(desc(!!!group_by_vars))
}
You can use arrange_at like this:
quick_smry <- function(df, x, ...) {
group_by_vars <- enquos(...)
check_var <- enquo(x)
df %>%
group_by(!!!group_by_vars) %>%
summarise(num_missing = sum(is.na(!!check_var))) %>%
arrange_at(group_by_vars, desc)
}
quick_smry(test_data, tst_var, t_year, t_mnth)

From map_dfr to SparkR's apply function

In the following code, I want to replace map_dfr from purrr with one of the SparkR apply functions to parallelize the Shapley calculations on the azure databricks:
#install.packages("randomForest"); install.packages("tidyverse"); install.packages("iml"); install.packages(SparkR)
library(tidyverse); library(iml); library(randomForest); library(SparkR)
mtcars1 <- mtcars %>% mutate(vs = as.factor(vs), id = row_number())
x <- "vs"
y <- paste0(setdiff(setdiff(names(mtcars1), "vs"), "id"), collapse = "+")
rf = randomForest(as.formula(paste0(x, "~ ", y)), data = mtcars1, ntree = 50)
predictor <- Predictor$new(rf, data = mtcars1, y = mtcars1$vs)
shapelyresults <- map_dfr(1:nrow(mtcars), ~(Shapley$new(predictor, x.interest = mtcars1[.x,]) %>%
.$results %>%
as_tibble() %>%
arrange(desc(phi)) %>%
slice(1:5) %>%
select(feature.value, phi) %>%
mutate(id = .x)))
I could not leverage the answer on the following link: How to apply a function to each row in SparkR?

Trying to understand how eval(expr, envir = df) works

I have built a function which seems to work, but I don't understand why.
My initial problem was to take a data.frame which contains counts of a population and expand it to re-create the original population. This is easy enough if you know the column names in advance.
library(tidyverse)
set.seed(121)
test_counts <- tibble(Population = letters[1:4], Length = c(1,1,2,1),
Number = sample(1:100, 4))
expand_counts_v0 <- function(Length, Population, Number) {
tibble(Population = Population,
Length = rep(Length, times = Number))
}
test_counts %>% pmap_dfr(expand_counts_v0) %>% # apply it
group_by(Population, Length) %>% # test it
summarise(Number = n()) %>%
ungroup %>%
{ all.equal(., test_counts)}
# [1] TRUE
However, I wanted to generalise it to a function which didn't need to know at the column names of the data.frame, and I'm interested in NSE, so I wrote:
test_counts1 <- tibble(Population = letters[1:4],
Length = c(1,1,2,1),
Number = sample(1:100, 4),
Height = c(100, 50, 45, 90),
Width = c(700, 50, 60, 90)
)
expand_counts_v1 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
make_tbl <- function(...) {
expr(tibble(!!!cols)) %>% eval(envir = df)
}
df %>% pmap_dfr(make_tbl)
}
But, when I test this function it seems to duplicate rows 4 times:
test_counts %>% expand_counts_v1(count = Number) %>%
group_by(Population, Length) %>%
summarise(Number = n()) %>%
ungroup %>%
{ sum(.$Number)/sum(test_counts$Number)}
# [1] 4
This lead me to guess a solution, which was
expand_counts_v2 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
make_tbl <- function(...) {
expr(tibble(!!!cols)) %>% eval(envir = df)
}
df %>% make_tbl
}
This seems to work:
test_counts %>% expand_counts_v2(count = Number) %>%
group_by(Population, Length) %>%
summarise(Number = n()) %>%
ungroup %>%
{ all.equal(., test_counts)}
# [1] TRUE
test_counts1 %>% expand_counts_v2(count = Number) %>%
group_by(Population, Length, Height, Width) %>%
summarise(Number = n()) %>%
ungroup %>%
{ all.equal(., test_counts1)}
# [1] TRUE
But I don't understand why. How is it evaluating for each row, even though I'm not using pmap anymore? The function needs to be applied to each row in order to work, so it must be somehow, but I can't see how it's doing that.
EDIT
After Artem's correct explanation of what was going on, I realised I could do this
expand_counts_v2 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
expr(tibble(!!!cols)) %>% eval_tidy(data = df)
}
Which gets rid of the unnecessary mk_tbl function. However, as Artem said, that is only really working because rep is vectorised. So, it's working, but not by re-writing the _v0 function and pmapping it, which is the process I was trying to replicate. Eventually, I discovered, rlang::new_function and wrote:
expand_counts_v3 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
all_names <- df %>% names %>% map(as.name)
args <- rep(0, times = length(all_names)) %>% as.list %>% set_names(all_names)
correct_function <- new_function(args, # this makes the function as in _v0
expr(tibble(!!!cols)) )
pmap_dfr(df, correct_function) # applies it as in _v0
}
which is longer, and probably uglier, but works the way I originally wanted.
The issue is in eval( envir = df ), which exposes the entire data frame to make_tbl(). Notice that you never use ... argument inside make_tbl(). Instead, the function effectively computes the equivalent of
with( df, tibble(Population = rep(Population, times = Number),
Length = rep(Length, times=Number)) )
regardless of what arguments you provide to it. When you call the function via pmap_dfr(), it essentially computes the above four times (once for each row) and concatenates the results by-row, resulting in the duplication of entries you've observed. When you remove pmap_dfr(), the function is called once, but since rep is itself vectorized (try doing rep( test_counts$Population, test_counts$Number ) to see what I mean), make_tbl() computes the entire result in one go.

Use variable names in function in dplyr for sum and cumsum

dplyr programming question here. Trying to write a dplyr function which takes column names as inputs and also filters on a component outlined in the function. What I am trying to recreate is as follow called test:
#test df
x<- sample(1:100, 10)
y<- sample(c(TRUE, FALSE), 10, replace = TRUE)
date<- seq(as.Date("2018-01-01"), as.Date("2018-01-10"), by =1)
my_df<- data.frame(x = x, y =y, date =date)
test<- my_df %>% group_by(date) %>%
summarise(total = n(), total_2 = sum(y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter(date >= "2018-01-03")
The function I am testing is as follows:
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- enquo(cumulative_y)
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(!!cumulative_y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data = my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-03")
I have looked looked at some examples of using enquo and this thread gets me half way there:
Use variable names in functions of dplyr
But the issue is I get two different data frame outputs for test 1 and test 2. The one from the function outputs does not have data from the logical y referenced column.
I also tried this instead
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- deparse(substitute(cumulative_y))
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(data[[cumulative_y]] ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data= my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-04")
Based on this thread: Pass a data.frame column name to a function
But the output from my test 2 column is also wildly different and it seems to do some kind or recursive accumulation. Which again is different to my test date frame.
If anyone can help that would be much appreciated.

Difference between double brackets and the $ sign

Since I have to use a function in a loop, I have to use dplyr group_by_at() and summarise_at() function. Unfortunately, I am not able to use the complete function from plyr to prevent empty groups to be removed by using an Index. Or is there another option to prevent dplyr from dropping empty groups?
library(dplyr)
library(plyr)
df1 <- mtcars %>%
group_by(gear) %>%
summarise(Mittelwert = mean(mpg, na.rm = TRUE)) %>%
complete(gear, fill = list(Gewicht = 1))
df1
df2 <- mtcars %>%
group_by_at(10) %>%
summarise_at(1, mean, na.rm = TRUE) %>%
complete(gear, fill = list(Gewicht = 1))

Resources