Change iteration number in one Dplyr command - r

I need to be able to change iteration number in each seperated line of one dplyr code.
I have prepared and example of 'by hand' approach and what I need to do in 'pseudo steps'.
library(tidyverse)
cr <-
mtcars %>%
group_by(gear) %>%
nest()
# This is 'by-hand' approach of what I would like to do - How to automate it? E.g. we do not know all values of 'carb'
cr$data[[1]] %>%
mutate(VARIABLE1 =
case_when(carb == 1 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE2 =
case_when(carb == 2 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE4 =
case_when(carb == 4 ~ hp/mpg,
TRUE ~ 0))
# This is a pseodu-idea of what I need to do. Is the any way how to change iteration number in ONE dplyr code?
vals <- cr$data[[1]] %>% pull(carb) %>% sort %>% unique()
for (i in vals) {
message(i)
cr$data[[1]] %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be first element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be second element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be third element of vals
TRUE ~ 0))
}
is there any trick maybe using purrr package or other solution as well?
I need to iterate over some unique values of some variable. And for each unique value create a new column in dataframe. I need to automatize this, however I am not able to do so on my own.

You can do this using sym to convert text to symbols and !! to evaluate within dplyr functions. See this question and this vignette for further details.
For your application, you probably want something like this:
carbs = c(1,2,4)
for(cc in carbs){
var_name = sym(paste0("VARIABLE",cc))
cr$data[[1]] = cr$data[[1]] %>%
mutate(!!var_name := case_when(carb == cc ~ hp/mpg,
TRUE ~ 0))
}
There are three key parts to this:
sym turns the text string into a symbol variable.
!! means that a symbol is treated as an R expression
:= lets us use !! evaluation on the left-hand-side of the equation

Related

default arguments not being recognized in custom function using dplyr

Take this function foo(). I want it to have a default argument of cyl because that's the name of the field it will usually process.
library(tidyverse)
foo <- function(x = cyl){
case_when(
x == 6 ~ TRUE,
x == 8 ~ FALSE,
x == 4 ~ NA
)
}
# works:
mtcars %>%
mutate(cyl_refactor = foo(cyl)) %>%
select(cyl, cyl_refactor)
But I am surprised that the function will not work unless I explicitly supply the default argument. See failing code below
# fails:
mtcars %>%
mutate(cyl_refactor = foo()) %>%
select(cyl, cyl_refactor)
Error: Problem with `mutate()` column `cyl_refactor`. ℹ `cyl_refactor = foo()`. x object 'cyl' not found
It seems that default arguments are only processed when there is also a data parameter as below.
foo2 <- function(data, x = cyl){
data %>%
mutate(cyl_refactor = case_when(
{{x}} == 6 ~ TRUE,
{{x}} == 8 ~ FALSE,
{{x}} == 4 ~ NA
))
}
mtcars %>%
foo2() %>%
select(cyl, cyl_refactor)
I am sure there is some gap in my knowledge of quasiquotation, but I would like to understand how to use a default argument in foo().
Here's one that will "work" though I woudn't recommend it
foo <- function(x = cyl){
x <- enquo(x)
eval.parent(rlang::quo_squash(rlang::quo(case_when(
!!x == 6 ~ TRUE,
!!x == 8 ~ FALSE,
!!x == 4 ~ NA
))))
}
# Both run without error
mtcars %>%
mutate(cyl_refactor = foo(cyl)) %>%
select(cyl, cyl_refactor)
mtcars %>%
mutate(cyl_refactor = foo()) %>%
select(cyl, cyl_refactor)
The problem is that in order for case_when to work, you can't just pass in a column name without also passing in the data. In order to "find" the data in this case, I've used eval.parent() to go up the call chain to try to find the cyl variable.
It's better to make proper functions where you pass in the input data directly (rather than variable names they need to look up themselves).
We could do this with missing and cur_data_all
foo <- function(x = cyl){
if(missing(x)) x <- cur_data_all()[["cyl"]]
case_when(
x == 6 ~ TRUE,
x == 8 ~ FALSE,
x == 4 ~ NA
)
}
-testing
> out1 <- mtcars %>%
+ mutate(cyl_refactor = foo(cyl)) %>%
+ select(cyl, cyl_refactor)
> out2 <- mtcars %>%
+ mutate(cyl_refactor = foo()) %>%
+ select(cyl, cyl_refactor)
>
> identical(out1, out2)
[1] TRUE

Conditional column sort

I have to sort one column in my df by checking a condition on a string.
Basically, I want to look into test.name and based on I want the column to be arranged asc or desc based on the value contained in it.
In the example below, I tried with paste0 after the pipe, but something is not working.
test.name <- "abc"
test.value <- data.frame(a = rnorm(100, 0, 1)
, b = rnorm(100, 0, 1))
result <- case_when(test.name == "bcd" ~ "desc"
, TRUE ~ "asc")
paste0("arrange(",result,"(b))",sep="")
test.value %>% paste0("arrange(",result,"(b))",sep="")
We could use parse_expr from rlang and evaluate (!!)
library(dplyr)
test.value %>%
arrange(!! rlang::parse_expr(case_when(test.name == 'bcd'~
'desc(b)', TRUE ~ 'b')))
Or we can use across as well
test.value %>%
arrange(across(b, ~ case_when(test.name == 'bcd' ~ desc(.), TRUE ~.)))

Automate Dplyr's mutate function

What is the best way to automate mutate function in one dplyr aggregation.
Best if I demonstrate on the example.
So in the first part of an example I am creating new columns based on values of variable gear. However, imagine I need to automate this step to automatically 'iterate' over all unique values of gear and creates new columns for each value.
Is there any how to do to so?
library(tidyverse)
cr <-
mtcars %>%
group_by(gear) %>%
nest()
# This is 'by-hand' approach of what I would like to do - How to automate it? E.g. we do not know all values of 'carb'
cr$data[[1]] %>%
mutate(VARIABLE1 =
case_when(carb == 1 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE2 =
case_when(carb == 2 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE4 =
case_when(carb == 4 ~ hp/mpg,
TRUE ~ 0))
# This is a pseodu-idea of what I need to do. Is the any way how to change iteration number in ONE dplyr code?
vals <- cr$data[[1]] %>% pull(carb) %>% sort %>% unique()
for (i in vals) {
message(i)
cr$data[[1]] %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be first element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be second element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be third element of vals
TRUE ~ 0))
}
One way would be to use dummy_cols from package fastDummies
Doing it for one dataframe at a time:
cr$data[[1]] %>%
dummy_cols(select_columns = 'carb')%>%
mutate_at(vars(starts_with('carb_')),funs(.*hp/mpg))
You can also do this first and the group by gear since you are not using gear value in calculation so it wouldn't matter. For that:
cr_new=mtcars %>%
dummy_cols(select_columns = 'carb')%>%
mutate_at(vars(starts_with('carb_')),funs(.*hp/mpg))%>%
group_by(gear)%>%
nest()
Perhaps, something like this would help -
library(dplyr)
library(purrr)
bind_cols(mtcars, map_dfc(unique(mtcars$carb),
~mtcars %>%
transmute(!!paste0('carb', .x) := case_when(carb == 1 ~ hp/mpg,TRUE ~ 0))))
It sounds a lot like what's called "the XY-problem".
https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
Please read about tidy data, and/or tidyr's pivot_longer/pivot_wider. Column names should not encode information.

Conditional formatting in expss tables

My question can be considering an extension of the following discussion: R expss package: format numbers by statistic / apply different format to alternate rows
I would like to understand the grammar of conditions to be able to write my own custom formats. Consider the 'insert' dataframe from datasets. Then we create the following table thanks to expss:
infert %>%
tab_cells(parity) %>%
### TOTAL
tab_cols(total()) %>%
tab_stat_cases(label="N", total_row_position="none") %>%
### OTHER VARIABLES
tab_cols(education) %>%
tab_stat_cases(label="N", total_row_position="none") %>%
tab_stat_cpct(label="%Col.", total_row_position="none") %>%
tab_pivot(stat_position="inside_columns") %>%
format_vert()
The last line operates basic formatting, as discussed in the URL above. In details:
format_vert = function(tbl, pct_digits=2, n_digits=0){
#Finding columns to format
pct_cols = grepl("\\|%Col.$", names(tbl), perl = TRUE)
n_cols = grepl("\\|N$", names(tbl), perl = TRUE)
#Format
recode(tbl[,-1]) = other ~ function(x) ifelse(is.numeric(x) & is.na(x), 0, x)
tbl[,pct_cols] = format(tbl[,pct_cols], digits=pct_digits, nsmall=pct_digits)
tbl[,n_cols] = format(tbl[,n_cols], digits=n_digits, nsmall=n_digits)
recode(tbl[,pct_cols]) = other ~ function(x) paste0(x, "%")
tbl
}
I understand how to format whole tables or columns (experts would have noticed the differences vs. the example in the URL), but what if I only wish to format specific cells? For instance, how to set digits=0 when value = 100,00% (to only show 100%) ?
I don't know if I should go for recode, format, when and where to reference tbl[,pct_cols]...
Thank you!
The simplest way is to insert additional recodings into recode in the function format_vert. We can't use recoding in the form of '100.00' ~ '100' because columns are already aligned with spaces. So we use regular expressions. perl means perl-style regex comparison and \\b means word boundary. All values which will match with such expression will be recoded.
data(infert)
format_vert = function(tbl, pct_digits=2, n_digits=0){
#Finding columns to format
pct_cols = grepl("\\|%Col.$", names(tbl), perl = TRUE)
n_cols = grepl("\\|N$", names(tbl), perl = TRUE)
#Format
recode(tbl[,-1]) = other ~ function(x) ifelse(is.numeric(x) & is.na(x), 0, x)
tbl[,pct_cols] = format(tbl[,pct_cols], digits=pct_digits, nsmall=pct_digits)
tbl[,n_cols] = format(tbl[,n_cols], digits=n_digits, nsmall=n_digits)
recode(tbl[,pct_cols]) = c(
perl("\\b0.00\\b") ~ "0% ", # additional recodings
perl("\\b100.00\\b") ~ "100% ", # additional recodings
other ~ function(x) paste0(x, "%")
)
tbl
}
infert %>%
tab_cells(parity) %>%
### TOTAL
tab_cols(total()) %>%
tab_stat_cases(label="N", total_row_position="none") %>%
### OTHER VARIABLES
tab_cols(education) %>%
tab_stat_cases(label="N", total_row_position="none") %>%
tab_stat_cpct(label="%Col.", total_row_position="none") %>%
tab_pivot(stat_position="inside_columns") %>%
format_vert()

dplyr self-join with filter

I want to subtract values from a row with label "baseline" from all the otherwise marked items in a long format data frame. It is easy to do this in two steps using a left_join with the "baseline" subset. However, I could not figure out how to combine vas_1 and vas_diff into one chain.
library(dplyr)
# Create test data
n_users = 5
vas = data_frame(
user = rep(letters[1:n_users], each = 3),
group = rep(c("baseline", "early", "late" ),n_users),
vas = round(rgamma(n_users*3, 10,1.4 ))
)
# The above data are given
# Assume some other operations are required
vas_1 = vas %>%
mutate(
vas = vas * 2
)
# I want to put the following into one
# chain with the above
# Use self-join to subtract baseline
vas_diff = vas_1 %>%
filter(group != "baseline") %>%
# Problem is vas_1 here. Using . gives error here
# Adding copy = TRUE does not help
# left_join(. %>% filter(group == "baseline") , by = c("user")) %>%
left_join(vas_1 %>% filter(group == "baseline") , by = c("user")) %>%
mutate(vas = vas.x - vas.y) %>% # compute offset
select(user, group.x, vas) # remove temporary variables
vas_diff
I use anonymous function when . should be used multiple times:
... %>% (function(df) { ... }) %>% ...
Hence, in your case:
vas_diff = vas_1 %>%
filter(group != "baseline") %>%
(function(df) left_join(df, df %>% filter(group == "baseline") , by = c("user"))) %>%
mutate(vas = vas.x - vas.y) %>% # compute offset
select(user, group.x, vas)
(which is not going produce desirable result as describe in comments above, but you it shows how to use anonymous function)
but probably you want this:
vas_diff = vas_1 %>%
left_join(
x = filter(., group != "baseline")
, y = filter(., group == "baseline")
, by = c("user")
) %>%
mutate(vas = vas.x - vas.y) %>% # compute offset
select(user, group.x, vas) # remove temporary variables
Here's a similar option, and a demonstration that you can pass whole pipe chains as arguments to the join. Instead of moving the . inside filter, you can pass . as an argument to eval, then remove unwanted columns in the right-hand side. This is largely just to document this approach for my own purposes.
vas_diff = vas_1 %>%
left_join(x = eval(.) %>%
filter(group != "baseline"),
y = eval(.) %>%
filter(group == "baseline") %>%
select(-group),
by = c("user")) %>%
mutate(vas = vas.x - vas.y) %>% # compute offset
select(user, group, vas)
Does anybody know why you can't simply pass the ., like x = . %>% filter ...... Why do we need to eval(.).

Resources