how to us if..else..statement in dplyr chaining in r - r

My question is how to us if..else..statement in dplyr chaining?
For example:
select.vars <- function(data, price=TRUE ){
diamonds %>% {if (price) select(price) else select(carat)}
}
select.vars(diamonds)
I got error:
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "logical"
This is a meaningless function. Just for illustration purpose...
Thanks a lot.

We can use the if/else within the select
select.vars <- function(data, price=TRUE){
diamonds %>%
select(if(price) "price" else "carat")
}
resprice <- select.vars(diamonds)
rescarat <- select.vars(diamonds, FALSE)
head(rescarat)
# A tibble: 6 x 1
# carat
# <dbl>
#1 0.23
#2 0.21
#3 0.23
#4 0.29
#5 0.31
#6 0.24
head(resprice)
# A tibble: 6 x 1
# price
# <int>
#1 326
#2 326
#3 327
#4 334
#5 335
#6 336

I just figured it out. Simply add the .for each select.
select.vars <- function(data, price=TRUE ){
diamonds %>% {if (price) select(., price) else select(., carat)}
}

Related

What is the correct way to use dplyr's slice_sample() within my apply function?

In the below code, I've simulated dice rolls at increasing sample sizes and computed the average roll at each sample size. My lapply function works, but I'm uncomfortable with it since I know sample_n is not a dplyr function and has been superceded by slice_sample. I would like make my code better with a dplyr solution rather than sample_n() within the lapply. I think I may have other syntactical errors within the lapply. Here is the code:
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = sample_n(dice_df,var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
The final step is computing the difference compared to the expected value, 3.5. I want a column where that shows the difference between 3.5 and the sample mean. We should see the difference decreasing as the sample size increases.
output <- output %>%
mutate(difference = across(sample_mean, ~3.5 - .x))
When I run this, it's throwing this error:
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "list"
I've tried using sapply but I get a similar error: no applicable method for 'mutate' applied to an object of class "c('matrix', 'array', 'list')"
If it helps, here was my failed attempt at using slice_sample:
output <- lapply(X=sample_sizes, FUN = function(...){
obs = slice_sample(dice_df, ..., .preserve=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, ...)
return(new.df)
})
I got this error: Error: '...' used in an incorrect context
The output is just a single row data.frame element in a list. We can bind them with bind_rows and simply subtract once instead of doing this multiple times
library(dplyr)
bind_rows(output) %>%
mutate(difference = 3.5 - sample_mean )
sample_mean var difference
1 3.500000 10 0.00000000
2 2.800000 25 0.70000000
3 3.440000 50 0.06000000
4 3.510000 100 -0.01000000
5 3.495000 1000 0.00500000
6 3.502200 10000 -0.00220000
7 3.502410 100000 -0.00241000
8 3.498094 1000000 0.00190600
9 3.500183 100000000 -0.00018332
The n argument of slice_sample correspondes to sample_n's size argument.
And to calculate the difference of your output list we can use purrr::map instead of dplyr::across.
library(dplyr)
library(purrr)
set.seed(123)
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = slice_sample(dice_df,n = var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
output %>%
map(~ 3.5 - .x$sample_mean)
#> [[1]]
#> [1] -0.5
#>
#> [[2]]
#> [1] 0.42
#>
#> [[3]]
#> [1] -0.04
#>
#> [[4]]
#> [1] -0.34
#>
#> [[5]]
#> [1] 0.025
#>
#> [[6]]
#> [1] 0.0317
#>
#> [[7]]
#> [1] 0.00416
#>
#> [[8]]
#> [1] -2.6e-05
#>
#> [[9]]
#> [1] -4.405e-05
Created on 2021-08-02 by the reprex package (v0.3.0)
Alternatively, we can use purrr::map_df and add a row diff inside each tibble as proposed by Martin Gal in the comments:
output %>%
map_df(~ tibble(.x, diff = 3.5 - .x$sample_mean))
#> # A tibble: 9 x 3
#> sample_mean var diff
#> <dbl> <dbl> <dbl>
#> 1 2.6 10 0.9
#> 2 3.28 25 0.220
#> 3 3.66 50 -0.160
#> 4 3.5 100 0
#> 5 3.53 1000 -0.0270
#> 6 3.50 10000 -0.00180
#> 7 3.50 100000 -0.00444
#> 8 3.50 1000000 -0.000226
#> 9 3.50 100000000 -0.0000669
Here is a base R way -
transform(do.call(rbind, output), difference = 3.5 - sample_mean)
# sample_mean var difference
#1 3.80 10 -0.300000
#2 3.44 25 0.060000
#3 3.78 50 -0.280000
#4 3.30 100 0.200000
#5 3.52 1000 -0.015000
#6 3.50 10000 -0.004200
#7 3.50 100000 -0.004370
#8 3.50 1000000 0.002696
#9 3.50 100000000 0.000356
If you just need the difference value you can do -
3.5 - sapply(output, `[[`, 'sample_mean')

using `rlang` quasiquotation with `dplyr::_join` functions

I am trying to write a custom function where I use rlang's quasiquotation. This function also internally uses dplyr's join functions. I have provided below a minimal working example that illustrated my problem.
# needed libraries
library(tidyverse)
# function definition
df_combiner <- function(data, x, group.by) {
# check how many variables were entered for this grouping variable
group.by <- as.list(rlang::quo_squash(rlang::enquo(group.by)))
# based on number of arguments, select `group.by` in cases like `c(cyl)`,
# the first list element after `quo_squash` will be `c` which we don't need,
# but if we pass just `cyl`, there is no `c`, this will take care of that
# issue
group.by <-
if (length(group.by) == 1) {
group.by
} else {
group.by[-1]
}
# creating internal dataframe
df <- dplyr::group_by(.data = data, !!!group.by, .drop = TRUE)
# creating dataframes to be joined: one with tally, one with summary
df_tally <- dplyr::tally(df)
df_mean <- dplyr::summarise(df, mean = mean({{ x }}, na.rm = TRUE))
# without specifying `by` argument, this works but prints a message I want to avoid
print(dplyr::left_join(x = df_tally, y = df_mean))
# joining by specifying `by` argument (my failed attempt)
dplyr::left_join(x = df_tally, y = df_mean, by = !!!group.by)
}
# using the function
df_combiner(diamonds, carat, c(cut, clarity))
#> Joining, by = c("cut", "clarity")
#> # A tibble: 40 x 4
#> # Groups: cut [5]
#> cut clarity n mean
#> <ord> <ord> <int> <dbl>
#> 1 Fair I1 210 1.36
#> 2 Fair SI2 466 1.20
#> 3 Fair SI1 408 0.965
#> 4 Fair VS2 261 0.885
#> 5 Fair VS1 170 0.880
#> 6 Fair VVS2 69 0.692
#> 7 Fair VVS1 17 0.665
#> 8 Fair IF 9 0.474
#> 9 Good I1 96 1.20
#> 10 Good SI2 1081 1.04
#> # ... with 30 more rows
#> Error in !group.by: invalid argument type
As can be seen here, I want to avoid the message #> Joining, by = c("cut", "clarity") and so explicitly want to input the by argument for the _join function but I am not sure how to do this.
(I've tried rlang::as_string, rlang::quo_name, etc.).
We can convert to string with as_string
dplyr::left_join(x = df_tally, y = df_mean,
by = map_chr(group.by, rlang::as_string))
df_combiner <- function(data, x, group.by) {
# check how many variables were entered for this grouping variable
group.by <- as.list(rlang::quo_squash(rlang::enquo(group.by)))
# based on number of arguments, select `group.by` in cases like `c(cyl)`,
# the first list element after `quo_squash` will be `c` which we don't need,
# but if we pass just `cyl`, there is no `c`, this will take care of that
# issue
group.by <-
if (length(group.by) == 1) {
group.by
} else {
group.by[-1]
}
# creating internal dataframe
df <- dplyr::group_by(.data = data, !!!group.by, .drop = TRUE)
# creating dataframes to be joined: one with tally, one with summary
df_tally <- dplyr::tally(df)
df_mean <- dplyr::summarise(df, mean = mean({{ x }}, na.rm = TRUE))
# without specifying `by` argument, this works but prints a message I want to avoid
#print(dplyr::left_join(x = df_tally, y = df_mean))
# joining by specifying `by` argument (my failed attempt)
dplyr::left_join(x = df_tally, y = df_mean, by = map_chr(group.by, rlang::as_string))
}
-checking
df_combiner(diamonds, carat, c(cut, clarity))
# A tibble: 40 x 4
# Groups: cut [5]
# cut clarity n mean
# <ord> <ord> <int> <dbl>
# 1 Fair I1 210 1.36
# 2 Fair SI2 466 1.20
# 3 Fair SI1 408 0.965
# 4 Fair VS2 261 0.885
# 5 Fair VS1 170 0.880
# 6 Fair VVS2 69 0.692
# 7 Fair VVS1 17 0.665
# 8 Fair IF 9 0.474
# 9 Good I1 96 1.20
#10 Good SI2 1081 1.04
# … with 30 more rows
Join functions take a string vector for its by argument. Use deparse to go from expressions to strings:
dplyr::left_join(x = df_tally, y = df_mean, by = map_chr(group.by, deparse))
As mentioned by earlier authors, ´by´ expects a string vector. An easy way to move from lists of quosures to strings is illustrated by stanwood on the RStudio Community thread Should tidyeval be abandoned?
...tidyr::left_join still expects a list of strings: by = c("Species",
"Sepal.Length"). If I want to supply these programatically the best
solution I found was by = sapply(sepaldims, quo_text). Consider this a
plug for abstracting quo_text to lists of quosures.
sepaldims <- quos(Species, Sepal.Length)

Optional argument programming with dplyr

I'm new to programming with dplyr. Let's say I have a function like this :
example <- function(data, group, var){
group <- enquo(group)
var <- enquo(var)
data %>%
group_by(!! group) %>%
summarise(
Min = min(!! var),
Max = max(!! var)
)
}
> diamonds %>% example(cut, price)
# A tibble: 5 x 3
cut Min Max
<ord> <dbl> <dbl>
1 Fair 337 18574
2 Good 327 18788
3 Very Good 336 18818
4 Premium 326 18823
5 Ideal 326 18806
I'd like to add a new argument sort. If non specified, the function does nothing more, if specified the function does arrange(desc(sort)).
I tried things like this :
example <- function(data, group, var, sort = NULL){
sort <- enquo(sort)
group <- enquo(group)
var <- enquo(var)
data <-
data %>%
group_by(!! group) %>%
summarise(
Min = min(!! var),
Max = max(!! var)
)
if(is.null(sort)) data
else arrange(data, desc(!! sort))
}
It works when the sort argument is specified
> diamonds %>% example(cut, price, sort = Min)
# A tibble: 5 x 3
cut Min Max
<ord> <dbl> <dbl>
1 Fair 337 18574
2 Very Good 336 18818
3 Good 327 18788
4 Premium 326 18823
5 Ideal 326 18806
But if I'm leaving sort blank I get the error message :
Error: cannot arrange column of class 'NULL' at position 1
How can I fix this ? Thanks for help and sorry for bad english
Moving sort <- enquo(sort) to the else section gets you closer, but you will still have a problem because the if(is.null(sort)) line will return an error if you use sort = Min in the function because is.null will want to evaluate the Min object, but will not find it.
Instead of using is.null, you could use missing, which does not evaluate the Min object.
example <- function(data, group, var, sort = NULL){
group <- enquo(group)
var <- enquo(var)
data <-
data %>%
group_by(!! group) %>%
summarise(
Min = min(!! var),
Max = max(!! var)
)
if(missing(sort)) {
data
} else { sort <- enquo(sort)
arrange(data, desc(!! sort))
}
}
Now we get the following outputs:
diamonds %>% example(cut, price)
# A tibble: 5 x 3
cut Min Max
<ord> <dbl> <dbl>
1 Fair 337 18574
2 Good 327 18788
3 Very Good 336 18818
4 Premium 326 18823
5 Ideal 326 18806
diamonds %>% example(cut, price, sort = Min)
# A tibble: 5 x 3
cut Min Max
<ord> <dbl> <dbl>
1 Fair 337 18574
2 Very Good 336 18818
3 Good 327 18788
4 Premium 326 18823
5 Ideal 326 18806
See Testing a function that uses enquo() for a NULL parameter

Pass multiple columns in dataframe into function at once in R

After much searching, I can't seem to figure this out.
Trying to write a function that:
takes a data frame, db
groups the data frame by var1
returns the mean and sd by group on several different columns
Here is my function,
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
for (i in var2) {
db %>%
group_by(!!var1) %>%
summarise(mean_var = mean(!!!var2))
}}
when I pass the following, nothing returns
myfun(data, group, age, bmi)
Ideally, I would like to group both age and bmi by group and return the mean and sd for each. In the future, I would like to pass many more columns from data into the function...
The output would be similar to summaryBy from doby package, but on many columns at once and would look like:
Group age.mean age.sd
0
1
bmi.mean bmi.sd
0
1
Your loop appears to be unnecessary (you aren't doing anything with i). Instead, you could use summarize_at to achieve the effect you want:
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
db %>%
group_by(!!var1) %>%
summarise_at(vars(!!!var2), c(mean = mean, sd = sd))
}
And if we test it out with diamonds dataset:
myfun(diamonds, cut, x, z)
cut x_mean z_mean x_sd z_sd
<ord> <dbl> <dbl> <dbl> <dbl>
1 Fair 6.25 3.98 0.964 0.652
2 Good 5.84 3.64 1.06 0.655
3 Very Good 5.74 3.56 1.10 0.730
4 Premium 5.97 3.65 1.19 0.731
5 Ideal 5.51 3.40 1.06 0.658
To get the formatting closer to what you had in mind in your original post, we can use a bit of tidyr magic:
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
db %>%
group_by(!!var1) %>%
summarise_at(vars(!!!var2), c(mean = mean, sd = sd)) %>%
gather(variable, value, -(!!var1)) %>%
separate(variable, c('variable', 'measure'), sep = '_') %>%
spread(measure, value) %>%
arrange(variable, !!var1)
}
cut variable mean sd
<ord> <chr> <dbl> <dbl>
1 Fair x 6.25 0.964
2 Good x 5.84 1.06
3 Very Good x 5.74 1.10
4 Premium x 5.97 1.19
5 Ideal x 5.51 1.06
6 Fair z 3.98 0.652
7 Good z 3.64 0.655
8 Very Good z 3.56 0.730
9 Premium z 3.65 0.731
10 Ideal z 3.40 0.658

ifelse() nested statements in summarize function in dplyr R

I am trying to summarise a dataframe based on grouping by label column. I want to obtain means based on the following conditions:
- if all numbers are NA - then I want to return NA
- if mean of all the numbers is 1 or lower - I want to return 1
- if mean of all the numbers is higher than 1 - I want a mean of the values in the group that are greater than 1
- all the rest should be 100.
Managed to find the answer and now my code is running well - is.na() should be there instead of ==NA in the first ifelse() statement and that was the issue.
label <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7)
sev <- c(NA,NA,NA,NA,1,0,1,1,1,NA,1,2,2,4,5,1,0,1,1,4,5)
Data2 <- data.frame(label,sev)
d <- Data2 %>%
group_by(label) %>%
summarize(sevmean = ifelse(is.na(mean(sev,na.rm=TRUE)),NA,
ifelse(mean(sev,na.rm=TRUE)<=1,1,
ifelse(mean(sev,na.rm=TRUE)>1,
mean(sev[sev>1],na.rm=TRUE),100))))
Your first condition is the issue here. If we remove the nested ifelse and keep only the first one, we get the same output
Data2 %>%
group_by(label) %>%
summarise(sevmean = ifelse(mean(sev,na.rm=TRUE)==NaN,NA,1))
# label sevmean
# <dbl> <lgl>
#1 1.00 NA
#2 2.00 NA
#3 3.00 NA
#4 4.00 NA
#5 5.00 NA
#6 6.00 NA
#7 7.00 NA
I am not sure why you are checking NaN but if you want to do that , check it with is.nan instead of ==
Data2 %>%
group_by(label) %>%
summarize(sevmean = ifelse(is.nan(mean(sev,na.rm=TRUE)),NA,
ifelse(mean(sev,na.rm=TRUE)<=1,1,
ifelse(mean(sev,na.rm=TRUE)>1,
mean(sev[sev>1],na.rm=TRUE),100))))
# label sevmean
# <dbl> <dbl>
#1 1.00 NA
#2 2.00 1.00
#3 3.00 1.00
#4 4.00 2.00
#5 5.00 3.67
#6 6.00 1.00
#7 7.00 4.50

Resources