Renaming a column name, by using the data frame title/name - r

I have a data frame called "Something". I am doing an aggregation on one of the numeric columns using summarise, and I want the name of that column to contain "Something" - data frame title in the column name.
Example:
temp <- Something %>%
group_by(Month) %>%
summarise(avg_score=mean(score))
But i would like to name the aggregate column as "avg_Something_score". Did that make sense?

We can use the devel version of dplyr (soon to be released 0.6.0) that does this with quosures
library(dplyr)
myFun <- function(data, group, value){
dataN <- quo_name(enquo(data))
group <- enquo(group)
value <- enquo(value)
newName <- paste0("avg_", dataN, "_", quo_name(value))
data %>%
group_by(!!group) %>%
summarise(!!newName := mean(!!value))
}
myFun(mtcars, cyl, mpg)
# A tibble: 3 × 2
# cyl avg_mtcars_mpg
# <dbl> <dbl>
#1 4 26.66364
#2 6 19.74286
#3 8 15.10000
myFun(iris, Species, Petal.Width)
# A tibble: 3 × 2
# Species avg_iris_Petal.Width
# <fctr> <dbl>
#1 setosa 0.246
#2 versicolor 1.326
#3 virginica 2.026
Here, the enquo takes the input arguments like substitute from base R and converts to quosure, with quo_name, we can convert it to string, evaluate the quosure by unquoting (!! or UQ) inside group_by/summarise/mutate etc. The column names on the lhs of assignment (:=) can also evaluated by unquoting to get the columns of interest

You can use rename_ from dplyr with deparse(substitute(Something)) like this:
Something %>%
group_by(Month) %>%
summarise(avg_score=mean(score))%>%
rename_(.dots = setNames("avg_score",
paste0("avg_",deparse(substitute(Something)),"_score") ))

It seems like it makes more sense to generate the new column name dynamically so that you don't have to hard-code the name of the data frame inside setNames. Maybe something like the function below, which takes a data frame, a grouping variable, and a numeric variable:
library(dplyr)
library(lazyeval)
my_fnc = function(data, group, value) {
df.name = deparse(substitute(data))
data %>%
group_by_(group) %>%
summarise_(avg = interp(~mean(v), v=as.name(value))) %>%
rename_(.dots = setNames("avg", paste0("avg_", df.name, "_", value)))
}
Now let's run the function on two different data frames:
my_fnc(mtcars, "cyl", "mpg")
cyl avg_mtcars_mpg
<dbl> <dbl>
1 4 26.66364
2 6 19.74286
3 8 15.10000
my_fnc(iris, "Species", "Petal.Width")
Species avg_iris_Petal.Width
1 setosa 0.246
2 versicolor 1.326
3 virginica 2.026

library(dplyr)
# Take mtcars as an example
# Calculate the mean of mpg using cyl as group
data(mtcars)
Something <- mtcars
# Create a list of expression
dots <- list(~mean(mpg))
# Apply the function, Use setNames to name the column
temp <- Something %>%
group_by(cyl) %>%
summarise_(.dots = setNames(dots,
paste0("avg_", as.character(quote(Something)), "_score")))

You could use colnames(Something)<-c("score","something_avg_score")

Related

dplyr: Is it possible to return two columns in summarize using one function?

Say I have a function that returns two scalars, and I want to use it with summarize, e.g.
fn = function(x) {
list(mean(x), sd(x))
}
iris %>%
summarize(fn(Petal.Length)) # Error: Column `fn(Petal.Length)` must be length 1 (a summary value), not 2
iris %>%
summarize(c("a","b") := fn(Petal.Length))
# Error: The LHS of `:=` must be a string or a symbol Run `rlang::last_error()` to see where the error occurred.
I tried both ways, but can't figure it out.
However, this can be done with data.table
library(data.table)
iris1 = copy(iris)
setDT(iris1)[, fn(Petal.Length)]
Is there a way to do this in dplyr?
Yes, you can save them as a list in a column and then use unnest_wider to separate them in different columns.
fn = function(x) {
list(mean = mean(x),sd = sd(x))
}
library(dplyr)
library(tidyr)
iris %>%
summarise(temp = list(fn(Petal.Length))) %>%
unnest_wider(temp)
# A tibble: 1 x 2
# mean sd
# <dbl> <dbl>
#1 3.76 1.77
Or unnest_longer to have them in separate rows
iris %>%
summarise(temp = list(fn(Petal.Length))) %>%
unnest_longer(temp)
# temp temp_id
# <dbl> <chr>
#1 3.76 mean
#2 1.77 sd

Pass a list of variable names to a function using {{foo}}

Problem
I would like to know how to pass a list of variable names to a purrr::map2 function for the purpose of iterating over a separate data frame.
The input_table$key variable below contains mpg and disp from the mtcars dataset. I think the names of the variables are being passed as character strings rather than variable names. The question is how I can change that so that my function recognises that they are variable names(?).
In this example I am trying to sum all of the values in the mtcars variables mpg and disp that fall below a set of numeric thresholds. Those variables from mtcars and the relevant thresholds are contained in input_table (below).
Ideal result
percentile key value sum_y
<fct> <chr> <dbl> <dbl>
1 0.5 mpg 19.2 266.5
2 0.9 mpg 30.1 515.8
3 0.99 mpg 33.4 609.0
4 1 mpg 33.9 642.9
5 ... ... ... ...
Attempt
library(dplyr)
library(purrr)
library(tidyr)
# Arrange a generic example
# Replicating my data structure
input_table <- mtcars %>%
as_tibble() %>%
select(mpg, disp) %>%
map_df(quantile, probs = c(0.5, 0.90, 0.99, 1)) %>%
mutate(
percentile = factor(c(0.5, 0.90, 0.99, 1))
) %>%
select(
percentile, mpg, disp
) %>%
gather(key, value, -percentile)
# Defining the function
test_func <- function(label_desc, threshold) {
mtcars %>%
select({{label_desc}}) %>%
filter({{label_desc}} <= {{threshold}}) %>%
summarise(
sum_y = sum(as.numeric({{label_desc}}), na.rm = T)
)
}
# Demo'ing that it works for a single variable and threshold value
test_func(label_desc = mpg, threshold = 19.2)
# This is where I am having trouble
# Trying to iterate over multiple (mpg, disp) variables
map2(input_table$key, input_table$value, ~test_func(label_desc = .x, threshold = .y))
The issue is curly-curly ({{}}) is used for unquoted variables as you are using in your first attempt. In your second attempt you are passing quoted variables to which the curly-curly operator does not work. A simple fix would be to use _at variants of dplyr which accepts quoted arguments.
test_func <- function(label_desc, threshold) {
mtcars %>%
filter_at(label_desc, any_vars(. <= threshold)) %>%
summarise_at(label_desc, sum)
}
purrr::map2(input_table$key, input_table$value, test_func)
#[[1]]
# mpg
#1 266.5
#[[2]]
# mpg
#1 515.8
#[[3]]
# mpg
#1 609
#[[4]]
# mpg
#1 642.9
#[[5]]
# disp
#1 1956.7
#.....

Summarizing by dynamic column name in dplyr

So I'm trying to do some programming in dplyr and I am having some trouble with the enquo and !! evaluations.
Basically I would like to mutate a column to a dynamic column name, and then be able to further manipulate that column (i.e. summarize). For instance:
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1)
}
my_function(iris, Petal.Length)
This works great and returns a column called "Petal.Length.adjusted" which is just Petal.Length increased by one.
However I can't seem to summarize this new column.
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
mean_col <- paste0(quo_column, "_meanAdjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1) %>%
group_by(Species) %>%
summarize(!!mean_col := mean(!!new_col))
}
my_function(iris, Petal.Length)
This results in a warning stating the argument "Petal.Length_adjusted" is not numeric or logical, although the output from the mutate call gives a numeric column.
How do I reference this dynamically generated column name to pass it in further dplyr functions?
Unlike the quo_column which is a quosure, the new_col and mean_col are strings, so we convert it to symbol using sym (from rlang) and then do the evaluation
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
mean_col <- paste0(quo_column, "_meanAdjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1) %>%
group_by(Species) %>%
summarise(!!mean_col := mean(!! rlang::sym(new_col)))
}
head(my_function(iris, Petal.Length))
# A tibble: 3 x 2
# Species Petal.Length_meanAdjusted
# <fct> <dbl>
#1 setosa 2.46
#2 versicolor 5.26
#3 virginica 6.55

dplyr+purrr: n() refers to map() groups, not local groups?

I have a parent dataset nesting multiple datasets (i.e. a tibble where each cell is a tibble) , where I want for each dataset, to find the number of rows of each group. Standard way, using a single dataset, would simply be to do group_by(var) %>% mutate(nrow=n()).
But now that I do this for multiple datasets with a map() call, it looks like the n() call refers to the (implicit) grouping made by map(), not the explicit grouping within my local dataset made by group_by?
Standard way for one single dataset, n() returns 50:
iris %>%
group_by(., Species) %>%
mutate(nrow=n())
Dataset of datasets:
df <- data_frame(name=c("a", "b"), Data=list(iris, iris))
df2 <- df %>%
mutate(Data2=map(Data, ~group_by(., Species) %>%
mutate(nrow=n()) %>%
ungroup()))
but now n() returned 2?
df2[1,]$Data2[[1]]
If you define the function outside of mutate it works fine (I assume this output is what you have in mind...)
fun <- function(x) {
df <- group_by(x, Species) %>%
summarise(nrow = n())
}
df2 <- df %>%
mutate(Data2=map(Data, fun))
df2$Data2
# [[1]]
# # A tibble: 3 x 2
# Species nrow
# <fctr> <int>
# 1 setosa 50
# 2 versicolor 50
# 3 virginica 50
#
# [[2]]
# # A tibble: 3 x 2
# Species nrow
# <fctr> <int>
# 1 setosa 50
# 2 versicolor 50
# 3 virginica 50
Another option, available since version 0.7.0 is to use add_count(), which will not conflict with the map(), and anyway simplifies the code:
# standard case:
iris %>%
add_count(Species)
## df of df:
df2 <- df %>%
mutate(Data2=map(Data, ~add_count(., Species)))

How to parametrize function calls in dplyr 0.7?

The release of dplyr 0.7 includes a major overhaul of programming with dplyr. I read this document carefully, and I am trying to understand how it will impact my use of dplyr.
Here is a common idiom I use when building reporting and aggregation functions with dplyr:
my_report <- function(data, grouping_vars) {
data %>%
group_by_(.dots=grouping_vars) %>%
summarize(x_mean=mean(x), x_median=median(x), ...)
}
Here, grouping_vars is a vector of strings.
I like this idiom because I can pass in string vectors from other places, say a file or a Shiny app's reactive UI, but it's also not too bad for interactive work either.
However, in the new programming with dplyr vignette, I see no examples of how something like this can be done with the new dplyr. I only see examples of how passing strings is no longer the correct approach, and I have to use quosures instead.
I'm happy to adopt quosures, but how exactly do I get from strings to the quosures expected by dplyr here? It doesn't seem feasible to expect the entire R ecosystem to provide quosures to dplyr - lots of times we're going to get strings and they'll have to be converted.
Here is an example showing what you're now supposed to do, and how my old idiom doesn't work:
library(dplyr)
grouping_vars <- quo(am)
mtcars %>%
group_by(!!grouping_vars) %>%
summarise(mean_cyl=mean(cyl))
#> # A tibble: 2 × 2
#> am mean_cyl
#> <dbl> <dbl>
#> 1 0 6.947368
#> 2 1 5.076923
grouping_vars <- "am"
mtcars %>%
group_by(!!grouping_vars) %>%
summarise(mean_cyl=mean(cyl))
#> # A tibble: 1 × 2
#> `"am"` mean_cyl
#> <chr> <dbl>
#> 1 am 6.1875
dplyr will have a specialized group_by function group_by_at to deal with multiple grouping variables. It would be much easier to use the new member of the _at family:
# using the pre-release 0.6.0
cols <- c("am","gear")
mtcars %>%
group_by_at(.vars = cols) %>%
summarise(mean_cyl=mean(cyl))
# Source: local data frame [4 x 3]
# Groups: am [?]
#
# am gear mean_cyl
# <dbl> <dbl> <dbl>
# 1 0 3 7.466667
# 2 0 4 5.000000
# 3 1 4 4.500000
# 4 1 5 6.000000
The .vars argument accepts both character/numeric vector or column names generated by vars:
.vars
A list of columns generated by vars(), or a character vector of
column names, or a numeric vector of column positions.
Here's the quick and dirty reference I wrote for myself.
# install.packages("rlang")
library(tidyverse)
dat <- data.frame(cat = sample(LETTERS[1:2], 50, replace = TRUE),
cat2 = sample(LETTERS[3:4], 50, replace = TRUE),
value = rnorm(50))
Representing column names with strings
Convert strings to symbol objects using rlang::sym and rlang::syms.
summ_var <- "value"
group_vars <- c("cat", "cat2")
summ_sym <- rlang::sym(summ_var) # capture a single symbol
group_syms <- rlang::syms(group_vars) # creates list of symbols
dat %>%
group_by(!!!group_syms) %>% # splice list of symbols into a function call
summarize(summ = sum(!!summ_sym)) # slice single symbol into call
If you use !! or !!! outside of dplyr functions you will get an error.
The usage of rlang::sym and rlang::syms is identical inside functions.
summarize_by <- function(df, summ_var, group_vars) {
summ_sym <- rlang::sym(summ_var)
group_syms <- rlang::syms(group_vars)
df %>%
group_by(!!!group_syms) %>%
summarize(summ = sum(!!summ_sym))
}
We can then call summarize_by with string arguments.
summarize_by(dat, "value", c("cat", "cat2"))
Using non-standard evaluation for column/variable names
summ_quo <- quo(value) # capture a single variable for NSE
group_quos <- quos(cat, cat2) # capture list of variables for NSE
dat %>%
group_by(!!!group_quos) %>% # use !!! with both quos and rlang::syms
summarize(summ = sum(!!summ_quo)) # use !! both quo and rlang::sym
Inside functions use enquo rather than quo. quos is okay though!?
summarize_by <- function(df, summ_var, ...) {
summ_quo <- enquo(summ_var) # can only capture a single value!
group_quos <- quos(...) # captures multiple values, also inside functions!?
df %>%
group_by(!!!group_quos) %>%
summarize(summ = sum(!!summ_quo))
}
And then our function call is
summarize_by(dat, value, cat, cat2)
If you want to group by possibly more than one column, you can use quos
grouping_vars <- quos(am, gear)
mtcars %>%
group_by(!!!grouping_vars) %>%
summarise(mean_cyl=mean(cyl))
# am gear mean_cyl
# <dbl> <dbl> <dbl>
# 1 0 3 7.466667
# 2 0 4 5.000000
# 3 1 4 4.500000
# 4 1 5 6.000000
Right now, it doesn't seem like there's a great way to turn strings into quos. Here's one way that does work though
cols <- c("am","gear")
grouping_vars <- rlang::parse_quosures(paste(cols, collapse=";"))
mtcars %>%
group_by(!!!grouping_vars) %>%
summarise(mean_cyl=mean(cyl))
# am gear mean_cyl
# <dbl> <dbl> <dbl>
# 1 0 3 7.466667
# 2 0 4 5.000000
# 3 1 4 4.500000
# 4 1 5 6.000000

Resources