Passing argument into function for group_by in dplyr [duplicate] - r

This question already has answers here:
How to pass column name as argument to function for dplyr verbs?
(4 answers)
Closed 7 months ago.
I am trying to use group_by within a function call in dplyr (R) and I am getting unexpected results. Here is an example of what I am trying to do:
df = data.frame(a = c(0,0,1,1), b = c(0,1,0,1), c = c(1,2,3,4))
result1 = df %>%
group_by(a,b) %>%
mutate(d = sum(c))
result1$d
myFunc <- function(df, var) {
output = df %>%
group_by(a,!!var) %>%
mutate(d = sum(c))
return(output)
}
result2 = myFunc(df,"b")
result2$d
result1$d yields [1,2,3,4] which is what I expected. result2$d yields [3,3,7,7] which I do not want, and I am not sure what is going on.
It works to have b (without quotes) as the function argument, and {{var}} in place of !!var. Unfortunately, in my case, my column names are in string format (but maybe there is a way to transform the string beforehand so that it will work with the {{}} notation?)

If you want to pass a character object that can refer to a certain column of a data frame, you should use !!sym(var):
myFunc <- function(df, var) {
output = df %>%
group_by(a, !!sym(var)) %>%
mutate(d = sum(c))
return(output)
}
myFunc(df, "b")
If you want to pass a data-masked argument, you should use {{ var }} or equivalently !!enquo(var):
myFunc <- function(df, var) {
output = df %>%
group_by(a, {{ var }}) %>%
mutate(d = sum(c))
return(output)
}
myFunc(df, b)
Note that I pass "b" and b respectively into the function in the two different cases.

If we want to use quoting and unquoting instead of curlycurly {{}} the we should consider this basic procedure: https://tidyeval.tidyverse.org/dplyr.html
Creating a function around dplyr pipelines involves three steps: abstraction, quoting, and unquoting.
1. Abstraction step:
Here we identify the varying steps. In our case var in group_by:
2. Quoting step:
Identify all the arguments where the user is allowed to refer to data frame columns directly.
The function can’t evaluate these arguments right away.
Instead they should be automatically quoted. Apply enquo() to these arguments
3. Unquoting step:
Identify where these variables are passed to other quoting functions and unquote with !!.
In this case we pass var to group_by():
myFunc <- function(df, var) {
var <- enquo(var)
output = df %>%
group_by(a,!!var) %>%
mutate(d = sum(c))
return(output)
}
result2 = myFunc(df,b)
output:
[1] 1 2 3 4

Just as I post a question, I come across something that works...
myFunc <- function(df, var) {
output = df %>%
group_by_at(.vars = c("a",var)) %>%
mutate(d = sum(c))
return(output)
}
result2 = myFunc(df,"b")

Related

How do I add checks to a function created using tidy eval framework?

Say I have created a function using the tidy eval framework -
library(tidyverse)
library(rlang)
my_function <- function(data, var){
var_expr <- enquo(var)
data %>%
group_by(!!var_expr) %>%
summarise(count = n()) %>%
ungroup()
}
When I run the following function, I get the result below it
my_function(mtcars, cyl)
# A tibble: 3 x 2
cyl count
<dbl> <int>
1 4 11
2 6 7
3 8 14
How do I add the following checks to this function -
Check if data is a dataframe. If not, return the error data should be a dataframe
Check if var is missing. If so return the error var is missing
You can make the following modifications.
In order to check if our input data is of a particular class we can check its class attribute and in this case whether it's a data frame or tibble they both contains the class data.frame
Also for missing function, it is normally used inside many functions to check whether an argument is assigned a value so that they generate a value as the default value. In your case we can terminate the execution of the function (you can also check the source code of length function on how it specifies a value for size argument when it is missing)
You can use base::stop in place of rlang::abort as specified by dear #akrun
library(rlang)
my_function <- function(data, var){
if(!"data.frame" %in% attr(data, "class")) {
abort("data should be a data frame")
}
if(missing(var)) {
abort("var is missing")
}
var_expr <- enquo(var)
data %>%
group_by(!!var_expr) %>%
summarise(count = n()) %>%
ungroup()
}
Special thanks to dear #27 ϕ 9 for bringing this valuable point to my attention. We can also customize the output error message in stopifnot function which is another way of checking your input arguments:
my_function <- function(data, var){
stopifnot("The input data is not of class data frame" = "data.frame" %in% attr(data, "class") ,
"var is missing" = !missing(var))
var_expr <- enquo(var)
data %>%
group_by(!!var_expr) %>%
summarise(count = n()) %>%
ungroup()
}
Special thanks to dear #IceCreamToucan for presenting yet another option which is using the inherits function in lieu of attr. In case the input data does not include data.frame in its class attributes it returns FALSE:
my_function <- function(data, var){
if(!inherits(data, "data.frame")) {
stop("data is not of class data.frame")
}
if(missing(var)) {
stop("var is missing")
}
var_expr <- enquo(var)
data %>%
group_by(!!var_expr) %>%
summarise(count = n()) %>%
ungroup()
}

Passing variable in function to other function variables in R

I am trying to pass a variable Phyla (which is also the name of a df column of interest) into other functions. However I get the error: Error: Columntax_levelis unknown. Which I understand. It would just be more convenient to state the column you want to use once in the function since this will also be repeated numerous times in the script. I Have tried using OTU_melt_grouped[,1] since this will always be the first column to use in the dcast function, but get the error: Error: Must use a vector in[, not an object of class matrix. Moreover, it does not solve my solution in the group_by function since I want to be able to specify Phyla, Class, Order etc...
I am sure there must be a simple solution, but I don't know where to start!
taxa_specific_columns_func <- function(data, tax_level = Phyla) {
OTU_melt_grouped <- data %>%
group_by(tax_level, variable) %>%
summarise(value = sum(value))
taxa_cols <- dcast(OTU_melt_grouped, variable ~ tax_level)
rownames(taxa_cols) <- meta_data$site
taxa_cols <- taxa_cols[-1]
return(taxa_cols)
}
tax_test <- taxa_specific_columns_func(OTU_melt)
As we are passing an unquoted variable, we could make use of curly-curly ({{..}}) operator in group_by
library(dplyr)
library(tidyr)
library(tibble)
taxa_specific_columns_func <- function(data, tax_level = Phyla) {
data %>%
group_by({{tax_level}}, variable) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = {{tax_level}}, values_from = value) %>%
column_to_rownames("variable")
}
taxa_specific_columns_func(OTU_melt)
# A B C D E
#a 0.01859254 0.42141238 -0.196961 -0.1859115 -0.2901680
#b -0.64700080 NA -0.161108 NA NA
#c -0.03297331 0.05871052 -1.963341 NA 0.7608218
data
set.seed(48)
OTU_melt <- data.frame(Phyla = rep(LETTERS[1:5], each = 3),
variable = sample(letters[1:3], 15, replace = TRUE), value = rnorm(15))

Calculate mode for each column in dataframe using lapply dplyr

I'm trying to create a function that essentially gets me the MODE...or MODE-X (2nd-Xth most common value & and the associated counts for each column in a data frame.
I can't figure out what I may be missing and I'm looking for some assistance? I believe it has to do with the passing in of a variable into dplyr function.
library(tidyverse)
myfunct_get_mode = function(x, rank=1){
mytable = dplyr::count(rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = table %>% dplyr::slice(rlang::sym(rank))
return(result)
}
mtcars %>% lapply(. %>% (function(x) myfunct_get_mode(x, rank=2)))
There are some problems with your function:
You function-call is not doing what you think. Check with mtcars %>% lapply(. %>% (function(x) print(x))) that actually your x is the whole column of mtcars. To get the names of the column apply the function to names(mtcars). But then you also have to specify the dataframe you're working on.
To evaluate a symbol you get sym from you need to use !! in front of the rlang::sym(x).
rank is not a variable name, thus no need for rlang::sym here.
table should be mytable in second to last line of your function.
So how could it work (although there are probably better ways):
myfunct_get_mode = function(df, x, rank=1){
mytable = count(df, !!rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = mytable %>% slice(rank)
return(result)
}
names(mtcars) %>% lapply(function(x) myfunct_get_mode(mtcars, x, rank=2))
If we need this in a list, we can use map
f1 <- function(dat, rank = 1) {
purrr::imap(dat, ~
dat %>%
count(!! rlang::sym(.y)) %>%
rename_all(~ c('variable', 'counts')) %>%
arrange(desc(counts)) %>%
slice(seq_len(rank))) #%>%
#bind_cols - convert to a data.frame
}
f1(mtcars, 2)

How can I simultaneously assign value to multiple new columns with R and dplyr?

Given
base <- data.frame( a = 1)
f <- function() c(2,3,4)
I am looking for a solution that would result in a function f being applied to each row of base data frame and the result would be appended to each row. Neither of the following works:
result <- base %>% rowwise() %>% mutate( c(b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( (b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( b,c,d = f() )
What is the correct syntax for this task?
This appears to be a similar problem (Assign multiple new variables on LHS in a single line in R) but I am specifically interested in solving this with functions from tidyverse.
I think the best you are going to do is a do() to modify the data.frame. Perhaps
base %>% do(cbind(., setNames(as.list(f()), c("b","c","d"))))
would probably be best if f() returned a list in the first place for the different columns.
In case you're willing to do this without dplyr:
# starting data frame
base_frame <- data.frame(col_a = 1:10, col_b = 10:19)
# the function you want applied to a given column
add_to <- function(x) { x + 100 }
# run this function on your base data frame, specifying the column you want to apply the function to:
add_computed_col <- function(frame, funct, col_choice) {
frame[paste(floor(runif(1, min=0, max=10000)))] = lapply(frame[col_choice], funct)
return(frame)
}
Usage:
df <- add_computed_col(base_frame, add_to, 'col_a')
head(df)
And add as many columns as needed:
df_b <- add_computed_col(df, add_to, 'col_b')
head(df_b)
Rename your columns.

R: row-wise dplyr::mutate using function that takes a data frame row and returns an integer

I am trying to use pipe mutate statement using a custom function. I looked a this somewhat similar SO post but in vain.
Say I have a data frame like this (where blob is some variable not related to the specific task but is part of the entire data) :
df <-
data.frame(exclude=c('B','B','D'),
B=c(1,0,0),
C=c(3,4,9),
D=c(1,1,0),
blob=c('fd', 'fs', 'sa'),
stringsAsFactors = F)
I have a function that uses the variable names so select some based on the value in the exclude column and e.g. calculates a sum on the variables not specified in exclude (which is always a single character).
FUN <- function(df){
sum(df[c('B', 'C', 'D')] [!names(df[c('B', 'C', 'D')]) %in% df['exclude']] )
}
When I gives a single row (row 1) to FUN I get the the expected sum of C and D (those not mentioned by exclude), namely 4:
FUN(df[1,])
How do I do similarly in a pipe with mutate (adding the result to a variable s). These two tries do not work:
df %>% mutate(s=FUN(.))
df %>% group_by(1:n()) %>% mutate(s=FUN(.))
UPDATE
This also do not work as intended:
df %>% rowwise(.) %>% mutate(s=FUN(.))
This works of cause but is not within dplyr's mutate (and pipes):
df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))
If you want to use dplyr you can do so using rowwise and your function FUN.
df %>%
rowwise %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The same can be achieved using group_by instead of rowwise (like you already tried) but with do instead of mutate
df %>%
group_by(1:n()) %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The reason mutate doesn't work in this case, is that you are passing the whole tibble to it, so it's like calling FUN(df).
A much more efficient way of doing the same thing though is to just make a matrix of columns to be included and then use rowSums.
cols <- c('B', 'C', 'D')
include_mat <- outer(function(x, y) x != y, X = df$exclude, Y = cols)
# or outer(`!=`, X = df$exclude, Y = cols) if it's more readable to you
df$s <- rowSums(df[cols] * include_mat)
purrr approach
We can use a combination of nest and map_dbl for this:
library(tidyverse)
df %>%
rowwise %>%
nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>%
unnest
Let's break that down a little bit. First, rowwise allows us to apply each subsequent function to support arbitrary complex operations that need to be applied to each row.
Next, nest will create a new column that is a list of our data to be fed into FUN (the beauty of tibbles vs data.frames!). Since we are applying this rowwise, each row contains a single-row tibble of exclude:D.
Finally, we use map_dbl to map our FUN to each of these tibbles. map_dbl is used over the family of other map_* functions since our intended output is numeric (i.e. double).
unnest returns our tibble into the more standard structure.
purrrlyr approach
While purrrlyr may not be as 'popular' as its parents dplyr and purrr, its by_row function has some utility here.
In your above example, we would use your data frame df and user-defined function FUN in the following way:
df %>%
by_row(..f = FUN, .to = "s", .collate = "cols")
That's it! Giving you:
# tibble [3 x 6]
exclude B C D blob s
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 B 1 3 1 fd 4
2 B 0 4 1 fs 5
3 D 0 9 0 sa 9
Admittedly, the syntax is a little strange, but here's how it breaks down:
..f = the function to apply to each row
.to = the name of the output column, in this case s
.collate = the way the results should be collated, by list, row, or column. Since FUN only has a single output, we would be fine to use either "cols" or "rows"
See here for more information on using purrrlyr...
Performance
Forewarning, while I like the functionality of by_row, it's not always the best approach for performance! purrr is more intuitive, but also at a rather large speed loss. See the following microbenchmark test:
library(microbenchmark)
mbm <- microbenchmark(
purrr.test = df %>% rowwise %>% nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>% unnest,
purrrlyr.test = df %>% by_row(..f = FUN, .to = "s", .collate = "cols"),
rowwise.test = df %>%
rowwise %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
group_by.test = df %>%
group_by(1:n()) %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
sapply.test = {df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))},
times = 1000
)
autoplot(mbm)
You can see that the purrrlyr approach is faster than the approach of using a combination of do with rowwise or group_by(1:n()) (see #konvas answer), and rather on par with the sapply approach. However, the package is admittedly not the most intuitive. The standard purrr approach seems to be the slowest, but also perhaps easier to work with. Different user-defined functions may change the speed order.

Resources