Summarizing by dynamic column name in dplyr - r

So I'm trying to do some programming in dplyr and I am having some trouble with the enquo and !! evaluations.
Basically I would like to mutate a column to a dynamic column name, and then be able to further manipulate that column (i.e. summarize). For instance:
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1)
}
my_function(iris, Petal.Length)
This works great and returns a column called "Petal.Length.adjusted" which is just Petal.Length increased by one.
However I can't seem to summarize this new column.
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
mean_col <- paste0(quo_column, "_meanAdjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1) %>%
group_by(Species) %>%
summarize(!!mean_col := mean(!!new_col))
}
my_function(iris, Petal.Length)
This results in a warning stating the argument "Petal.Length_adjusted" is not numeric or logical, although the output from the mutate call gives a numeric column.
How do I reference this dynamically generated column name to pass it in further dplyr functions?

Unlike the quo_column which is a quosure, the new_col and mean_col are strings, so we convert it to symbol using sym (from rlang) and then do the evaluation
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
mean_col <- paste0(quo_column, "_meanAdjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1) %>%
group_by(Species) %>%
summarise(!!mean_col := mean(!! rlang::sym(new_col)))
}
head(my_function(iris, Petal.Length))
# A tibble: 3 x 2
# Species Petal.Length_meanAdjusted
# <fct> <dbl>
#1 setosa 2.46
#2 versicolor 5.26
#3 virginica 6.55

Related

Take function from dataframe

I need to perform calculation based on inputs defined in a dataframe. Refer the dataframe RefDf below. It has 3 columns - column name, calculation, New Variable Name. When Calculation column contains count, we should use n_distinct( ) function.
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length count Petal.LengthNew
", header = T)
Manual Approach - Needs to be automated via inputs in RefDf. Species remains same for grouping.
library(dplyr)
iris %>% group_by_at("Species") %>%
summarise(Sepal.Length2 = sum(Sepal.Length,na.rm = T),
Petal.LengthNew = n_distinct(Petal.Length, na.rm = T)
)
I am looking for dplyr or base R based solution
Here's a solution with data.table package
library(data.table)
library(dplyr)
# using data.table
dt <- as.data.table(RefDf)
dt[Calculation == "count", Calculation := "n_distinct"]
# function for doing grouping calculation
inner.fun <- function(calc, data, column, group="Species"){
print(column)
data.dt <- as.data.table(data)
data.dt[, .(as.numeric(get(calc)(get(column)))), by=group][]
}
out <- dt[, inner.fun(calc=Calculation, data=iris, column=Variables), by=NewVariable]
# reshape from wide to long
out2 <- dcast(data=out, Species ~ NewVariable, value.var="V1")
# convert to data.frame
out_df <- as.data.frame(out2)
out_df
Species Petal.LengthNew Sepal.Length2
1 setosa 9 250.3
2 versicolor 19 296.8
3 virginica 20 329.4

dplyr: Is it possible to return two columns in summarize using one function?

Say I have a function that returns two scalars, and I want to use it with summarize, e.g.
fn = function(x) {
list(mean(x), sd(x))
}
iris %>%
summarize(fn(Petal.Length)) # Error: Column `fn(Petal.Length)` must be length 1 (a summary value), not 2
iris %>%
summarize(c("a","b") := fn(Petal.Length))
# Error: The LHS of `:=` must be a string or a symbol Run `rlang::last_error()` to see where the error occurred.
I tried both ways, but can't figure it out.
However, this can be done with data.table
library(data.table)
iris1 = copy(iris)
setDT(iris1)[, fn(Petal.Length)]
Is there a way to do this in dplyr?
Yes, you can save them as a list in a column and then use unnest_wider to separate them in different columns.
fn = function(x) {
list(mean = mean(x),sd = sd(x))
}
library(dplyr)
library(tidyr)
iris %>%
summarise(temp = list(fn(Petal.Length))) %>%
unnest_wider(temp)
# A tibble: 1 x 2
# mean sd
# <dbl> <dbl>
#1 3.76 1.77
Or unnest_longer to have them in separate rows
iris %>%
summarise(temp = list(fn(Petal.Length))) %>%
unnest_longer(temp)
# temp temp_id
# <dbl> <chr>
#1 3.76 mean
#2 1.77 sd

Using dplyr group_by in a function

I am trying to use dplyr's group_by in a local function, example:
testFunction <- function(df, x) {
df %>%
group_by(x) %>%
summarize(mean.Petal.Width = mean(Petal.Width))
}
testFunction(iris, Species)
and I get an error "... unknown variable to group by: x"
I've tried group_by_ and it gives me a summary of the entire dataset.
Anybody have a clue how I can fix this?
Thanks in advance!
Here is one way to work with the new enquo from dplyr, where enquo takes the string and converts to quosure which gets evaluated by unquoting (UQ or !!) in group_by, mutate, summarise etc.
library(dplyr)
testFunction <- function(df, x) {
x <- enquo(x)
df %>%
group_by(!! x) %>%
summarize(mean.Petal.Width = mean(Petal.Width))
}
testFunction(iris, Species)
# A tibble: 3 x 2
# Species mean.Petal.Width
# <fctr> <dbl>
#1 setosa 0.246
#2 versicolor 1.326
#3 virginica 2.026
I got it to work like this:
testFunction <- function(df, x) {
df %>%
group_by(get(x)) %>%
summarize(mean.Petal.Width = mean(Petal.Width))
}
testFunction(iris,"Species")
I changed x to get(x), and Species to "Species" in testFunction(iris,...).

R dplyr methods inside own function

Consider this dplyr treatment to a data frame:
existing.df <- filter(existing.df, justanEx > 0) %>%
arrange(desc(justanEx)) %>%
mutate(mean = mean(justanEx),
median = median(justanEx),
rank = seq_len(length(anotherVar)))
I have to do this a lot on an job I'm doing, so I tried making a function for it:
df.overZ <- function(data, var){
df <- data %>% filter(var > 0) %>%
arrange_(desc((var))) %>%
mutate(mean = mean(var),
median = median(var),
rank = seq_len(length(anotherVar)))
df
}
and them
existing.df <- df.overZ(existing.df, "realVar")
but this gives me this error:
Error in arrange_impl(.data, dots) :
incorrect size (1), expecting : 50000
If I try:
existing.df <- df.overZ(existing.df, realVar)
I get this error:
Error in filter_impl(.data, dots) : obj 'realVar' not found
I have already tried filter_, arrange_ and mutate_,
but nothing sens to work.
Can this work?
The following function works, though:
make.df <- function(var, n){
df <- orign.df %>% filter(!is.na(var)) %>%
select(1:2,n,3:6)
df
}
existing.df <- make.df("oneVar",7)
With the devel version of dplyr (soon to be released 0.6.0), we can make use of the quosures
library(dplyr)
df.overZ <- function(data, Var){
Var <- enquo(Var)
data %>%
filter(UQ(Var) > 0) %>%
arrange(desc(UQ(Var))) %>%
mutate(Mean = mean(UQ(Var)),
Median = median(UQ(Var)),
rank = row_number())
}
df.overZ(iris, Sepal.Length)
We can extend this function to have a group_by option as well
df.overZ2 <- function(data, Var, grpVar){
Var <- enquo(Var)
grpVar <- enquo(grpVar)
newVar <- paste(quo_name(Var), c("Mean", "Median", "Rank"), sep="_")
data %>%
filter(UQ(Var) > 0) %>%
arrange(desc(UQ(Var))) %>%
group_by(UQ(grpVar)) %>%
summarise(UQ(newVar[1]) := mean(UQ(Var)),
UQ(newVar[2]) := median(UQ(Var)),
UQ(newVar[3]) := n())
}
df.overZ2(iris, Sepal.Length, Species)
# A tibble: 3 × 4
# Species Sepal.Length_Mean Sepal.Length_Median Sepal.Length_Rank
# <fctr> <dbl> <dbl> <int>
#1 setosa 5.006 5.0 50
#2 versicolor 5.936 5.9 50
#3 virginica 6.588 6.5 50
Here, the enquo does a similar job as substitute from base R by taking the input arguments and converting it to quosure, then within the functions (filter/arrange/mutate/summarise/group_by) we unquote (!! or UQ) to evaluate it. We can also name the columns by passing the quosure on the lhs of the assignment (:=)

Renaming a column name, by using the data frame title/name

I have a data frame called "Something". I am doing an aggregation on one of the numeric columns using summarise, and I want the name of that column to contain "Something" - data frame title in the column name.
Example:
temp <- Something %>%
group_by(Month) %>%
summarise(avg_score=mean(score))
But i would like to name the aggregate column as "avg_Something_score". Did that make sense?
We can use the devel version of dplyr (soon to be released 0.6.0) that does this with quosures
library(dplyr)
myFun <- function(data, group, value){
dataN <- quo_name(enquo(data))
group <- enquo(group)
value <- enquo(value)
newName <- paste0("avg_", dataN, "_", quo_name(value))
data %>%
group_by(!!group) %>%
summarise(!!newName := mean(!!value))
}
myFun(mtcars, cyl, mpg)
# A tibble: 3 × 2
# cyl avg_mtcars_mpg
# <dbl> <dbl>
#1 4 26.66364
#2 6 19.74286
#3 8 15.10000
myFun(iris, Species, Petal.Width)
# A tibble: 3 × 2
# Species avg_iris_Petal.Width
# <fctr> <dbl>
#1 setosa 0.246
#2 versicolor 1.326
#3 virginica 2.026
Here, the enquo takes the input arguments like substitute from base R and converts to quosure, with quo_name, we can convert it to string, evaluate the quosure by unquoting (!! or UQ) inside group_by/summarise/mutate etc. The column names on the lhs of assignment (:=) can also evaluated by unquoting to get the columns of interest
You can use rename_ from dplyr with deparse(substitute(Something)) like this:
Something %>%
group_by(Month) %>%
summarise(avg_score=mean(score))%>%
rename_(.dots = setNames("avg_score",
paste0("avg_",deparse(substitute(Something)),"_score") ))
It seems like it makes more sense to generate the new column name dynamically so that you don't have to hard-code the name of the data frame inside setNames. Maybe something like the function below, which takes a data frame, a grouping variable, and a numeric variable:
library(dplyr)
library(lazyeval)
my_fnc = function(data, group, value) {
df.name = deparse(substitute(data))
data %>%
group_by_(group) %>%
summarise_(avg = interp(~mean(v), v=as.name(value))) %>%
rename_(.dots = setNames("avg", paste0("avg_", df.name, "_", value)))
}
Now let's run the function on two different data frames:
my_fnc(mtcars, "cyl", "mpg")
cyl avg_mtcars_mpg
<dbl> <dbl>
1 4 26.66364
2 6 19.74286
3 8 15.10000
my_fnc(iris, "Species", "Petal.Width")
Species avg_iris_Petal.Width
1 setosa 0.246
2 versicolor 1.326
3 virginica 2.026
library(dplyr)
# Take mtcars as an example
# Calculate the mean of mpg using cyl as group
data(mtcars)
Something <- mtcars
# Create a list of expression
dots <- list(~mean(mpg))
# Apply the function, Use setNames to name the column
temp <- Something %>%
group_by(cyl) %>%
summarise_(.dots = setNames(dots,
paste0("avg_", as.character(quote(Something)), "_score")))
You could use colnames(Something)<-c("score","something_avg_score")

Resources