dplyr::group_by_ with character string input of several variable names - r

I'm writing a function where the user is asked to define one or more grouping variables in the function call. The data is then grouped using dplyr and it works as expected if there is only one grouping variable, but I haven't figured out how to do it with multiple grouping variables.
Example:
x <- c("cyl")
y <- c("cyl", "gear")
dots <- list(~cyl, ~gear)
library(dplyr)
library(lazyeval)
mtcars %>% group_by_(x) # groups by cyl
mtcars %>% group_by_(y) # groups only by cyl (not gear)
mtcars %>% group_by_(.dots = dots) # groups by cyl and gear, this is what I want.
I tried to turn y into the same as dots using:
mtcars %>% group_by_(.dots = interp(~var, var = list(y)))
#Error: is.call(expr) || is.name(expr) || is.atomic(expr) is not TRUE
How to use a user-defined input string of > 1 variable names (like y in the example) to group the data using dplyr?
(This question is somehow related to this one but not answered there.)

No need for interp here, just use as.formula to convert the strings to formulas:
dots = sapply(y, . %>% {as.formula(paste0('~', .))})
mtcars %>% group_by_(.dots = dots)
The reason why your interp approach doesn’t work is that the expression gives you back the following:
~list(c("cyl", "gear"))
– not what you want. You could, of course, sapply interp over y, which would be similar to using as.formula above:
dots1 = sapply(y, . %>% {interp(~var, var = .)})
But, in fact, you can also directly pass y:
mtcars %>% group_by_(.dots = y)
The dplyr vignette on non-standard evaluation goes into more detail and explains the difference between these approaches.

slice_rows() from the purrrlyr package (https://github.com/hadley/purrrlyr) groups a data.frame by taking a vector of column names (strings) or positions (integers):
y <- c("cyl", "gear")
mtcars_grp <- mtcars %>% purrrlyr::slice_rows(y)
class(mtcars_grp)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
group_vars(mtcars_grp)
#> [1] "cyl" "gear"
Particularly useful now that group_by_() has been depreciated.

Related

Changing factors order inside a function [duplicate]

I have been reading from this SO post on how to work with string references to variables in dplyr.
I would like to mutate a existing column based on string input:
var <- 'vs'
my_mtcars <- mtcars %>%
mutate(get(var) = factor(get(var)))
Error: unexpected '=' in:
"my_mtcars <- mtcars %>%
mutate(get(var) ="
Also tried:
my_mtcars <- mtcars %>%
mutate(!! rlang::sym(var) = factor(!! rlang::symget(var)))
This resulted in the exact same error message.
How can I do the following based on passing string 'vs' within var variable to mutate?
# works
my_mtcars <- mtcars %>%
mutate(vs = factor(vs))
This operation can be carried out with := while evaluating (!!) and using the conversion to symbol and evaluating on the rhs of assignment
library(dplyr)
my_mtcars <- mtcars %>%
mutate(!! var := factor(!! rlang::sym(var)))
class(my_mtcars$vs)
#[1] "factor"
Or without thinking too much, use mutate_at, which can take strings in vars and apply the function of interest
my_mtcars2 <- mtcars %>%
mutate_at(vars(var), factor)

Replacement of the dot function from plyr

How can I transform a vector of groups specified using the plyr function . such as .(group, sex) into a vector of characters like this c("group", "sex").
We used the plyr approach to specify the groups in an older version of our R package. In the new version we want the user to specify the groups using a vector of strings, but we do not want to break previous code that used the dot approach.
Example of the old function:
library(plyr)
my_function_old <- function(df, grouping) {
ddply(df, grouping, summarize,
m = mean(mpg))
}
my_function_old(mtcars, .(cyl, vs))
Example of the new function:
library(dplyr)
my_function_new <- function(df, grouping) {
df %>%
group_by(!!!syms(grouping)) %>%
summarise(m = mean(mpg))
}
my_function_new(mtcars, c("cyl", "vs"))
In the new function the grouping should be specified using a vector of strings. I would like to check whether the user is using the old dot notation in the new function and in that case to transform the grouping variables specified with the dot to a vector of strings.
Using enexpr
library(dplyr)
my_function <- function(df, grouping) {
grouping <- as.character(enexpr(grouping))[-1]
df %>%
group_by(!!!syms(grouping)) %>%
summarise(m = mean(mpg))
}
my_function(mtcars, c("cyl", "vs")) # this works
my_function(mtcars, .(cyl, vs)) # this also works

Error when using dplyr inside of a function

I'm trying to put together a function that creates a subset from my original data frame, and then uses dplyr's SELECT and MUTATE to give me the number of large/small entries, based on the sum of the width and length of sepals/petals.
filter <- function (spp, LENGTH, WIDTH) {
d <- subset (iris, subset=iris$Species == spp) # This part seems to work just fine
large <- d %>%
select (LENGTH, WIDTH) %>% # This is where the problem arises.
mutate (sum = LENGTH + WIDTH)
big_samples <- which(large$sum > 4)
return (length(big_samples))
}
Basically, I want the function to return the number of large flowers. However, when I run the function I get the following error -
filter("virginica", "Sepal.Length", "Sepal.Width")
Error: All select() inputs must resolve to integer column positions.
The following do not:
* LENGTH
* WIDTH
What am I doing wrong?
You are running into NSE/SE problems, see the vignette for more info.
Briefly, dplyr uses a non standard evaluation (NSE) of names, and passing names of columns into functions breaks it, without using the standard evaluation (SE) version.
The SE versions of the dplyr functions end in _. You can see that select_ works nicely with your original arguments.
However, things get more complicated when using functions. We can use lazyeval::interp to convert most function arguments into column names, see the conversion of the mutate to mutate_ call in your function below and more generally, the help: ?lazyeval::interp
Try:
filter <- function (spp, LENGTH, WIDTH) {
d <- subset (iris, subset=iris$Species == spp)
large <- d %>%
select_(LENGTH, WIDTH) %>%
mutate_(sum = lazyeval::interp(~X + Y, X = as.name(LENGTH), Y = as.name(WIDTH)))
big_samples <- which(large$sum > 4)
return (length(big_samples))
}
UPDATE: As of dplyr 0.7.0 you can use tidy eval to accomplish this.
See http://dplyr.tidyverse.org/articles/programming.html for more details.
filter_big <- function(spp, LENGTH, WIDTH) {
LENGTH <- enquo(LENGTH) # Create quosure
WIDTH <- enquo(WIDTH) # Create quosure
iris %>%
filter(Species == spp) %>%
select(!!LENGTH, !!WIDTH) %>% # Use !! to unquote the quosure
mutate(sum = (!!LENGTH) + (!!WIDTH)) %>% # Use !! to unquote the quosure
filter(sum > 4) %>%
nrow()
}
filter_big("virginica", Sepal.Length, Sepal.Width)
> filter_big("virginica", Sepal.Length, Sepal.Width)
[1] 50
If quosure and quasiquotation are too much for you, use either .data[[ ]] or rlang {{ }} (curly curly) instead. See Hadley Wickham's 5min video on tidy evaluation and (maybe) Tidy evaluation section in Hadley's Advanced R book for more information.
library(rlang)
library(dplyr)
filter_data <- function(df, spp, LENGTH, WIDTH) {
res <- df %>%
filter(Species == spp) %>%
select(.data[[LENGTH]], .data[[WIDTH]]) %>%
mutate(sum = .data[[LENGTH]] + .data[[WIDTH]]) %>%
filter(sum > 4) %>%
nrow()
return(res)
}
filter_data(iris, "virginica", "Sepal.Length", "Sepal.Width")
#> [1] 50
filter_rlang <- function(df, spp, LENGTH, WIDTH) {
res <- df %>%
filter(Species == spp) %>%
select({{LENGTH}}, {{WIDTH}}) %>%
mutate(sum = {{LENGTH}} + {{WIDTH}}) %>%
filter(sum > 4) %>%
nrow()
return(res)
}
filter_rlang(iris, "virginica", Sepal.Length, Sepal.Width)
#> [1] 50
Created on 2019-11-10 by the reprex package (v0.3.0)

Pass column name to function from mutate_each

I'd like to apply a transformation to all columns via dplyr::mutate_each, e.g.
library(dplyr)
mult <- function(x,m) return(x*m)
mtcars %>% mutate_each(funs(mult(.,2))) # Multiply all columns by a factor of two
However, the transformation should have parameters depending on the column name. Therefore, the column name should be passed to the function as an additional argument
named.mult <- function(x,colname) return(x*param.A[[colname]])
Example: multiply every column by a different factor:
param.A <- c()
param.A[names(mtcars)] <- seq(length(names(mtcars)))
param.A
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 2 3 4 5 6 7 8 9 10 11
Since the column name gets lost during mutate_each, I currently work around this by passing a list with lazy evalution to mutate_ (the SE version):
library(lazyeval)
named.mutate <- function(fun, cols) sapply(cols, function(n) interp(~fun(col, n), fun=fun, col=as.name(n)))
mtcars %>% mutate_(.dots=named.mutate(named.mult, names(.)))
Works, but is there some special variable like .name which contains the column name of . for each colwise execution? So I could do something like
mtcars %>% mutate_each(funs(named.mult(.,.name)))
I'd suggest taking a different approach. Instead of using mutate_each a combination of dplyr::mutate with tidyr::gather and tidyr::spread can achieve the same result.
For example:
library(dplyr)
library(tidyr)
data(mtcars)
# Multiple each column by a different interger
mtcars %>%
dplyr::tbl_df() %>%
dplyr::mutate(make_and_model = rownames(mtcars)) %>%
tidyr::gather(key, value, -make_and_model) %>%
dplyr::mutate(m = as.integer(factor(key)), # a multiplication factor dependent on column name
value = value * m) %>%
dplyr::select(-m) %>%
tidyr::spread(key, value)
# compare to the original data
mtcars[order(rownames(mtcars)), order(names(mtcars))]
# the muliplicative values used.
mtcars %>%
tidyr::gather() %>%
dplyr::mutate(m = as.integer(factor(key))) %>%
dplyr::select(-value) %>%
dplyr::distinct()

How to programmatically group a data_frame by each column name specified in a vector? [duplicate]

I'm writing a function where the user is asked to define one or more grouping variables in the function call. The data is then grouped using dplyr and it works as expected if there is only one grouping variable, but I haven't figured out how to do it with multiple grouping variables.
Example:
x <- c("cyl")
y <- c("cyl", "gear")
dots <- list(~cyl, ~gear)
library(dplyr)
library(lazyeval)
mtcars %>% group_by_(x) # groups by cyl
mtcars %>% group_by_(y) # groups only by cyl (not gear)
mtcars %>% group_by_(.dots = dots) # groups by cyl and gear, this is what I want.
I tried to turn y into the same as dots using:
mtcars %>% group_by_(.dots = interp(~var, var = list(y)))
#Error: is.call(expr) || is.name(expr) || is.atomic(expr) is not TRUE
How to use a user-defined input string of > 1 variable names (like y in the example) to group the data using dplyr?
(This question is somehow related to this one but not answered there.)
No need for interp here, just use as.formula to convert the strings to formulas:
dots = sapply(y, . %>% {as.formula(paste0('~', .))})
mtcars %>% group_by_(.dots = dots)
The reason why your interp approach doesn’t work is that the expression gives you back the following:
~list(c("cyl", "gear"))
– not what you want. You could, of course, sapply interp over y, which would be similar to using as.formula above:
dots1 = sapply(y, . %>% {interp(~var, var = .)})
But, in fact, you can also directly pass y:
mtcars %>% group_by_(.dots = y)
The dplyr vignette on non-standard evaluation goes into more detail and explains the difference between these approaches.
slice_rows() from the purrrlyr package (https://github.com/hadley/purrrlyr) groups a data.frame by taking a vector of column names (strings) or positions (integers):
y <- c("cyl", "gear")
mtcars_grp <- mtcars %>% purrrlyr::slice_rows(y)
class(mtcars_grp)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
group_vars(mtcars_grp)
#> [1] "cyl" "gear"
Particularly useful now that group_by_() has been depreciated.

Resources