There is a data.frame like so:
df <- data.frame("Config" = c("C1","C1","C2","C2"), "SN1" = 1:4, "SN2" = 5:8)
I'm trying to make group_by %>% summarise more generic. Here is an example that does not work:
variable <- "SN1"
df %>%
group_by(
Config
) %>%
summarise(
paste0(variable, ".median")=median(UQ(as.symbol(variable)))
) %>%
as.data.frame() ->
df_summary
It does not work because of paste0(variable, ".median") part.
Related question-answer Pass arguments to dplyr functions helped me to parameterize median(UQ(as.symbol(variable))) part but it does not mention the left-hand side part.
Is there a way to fix the above?
You can use enquo and !!
library(tidyverse)
mysumm <- function(variable){
var <- enquo(variable)
df %>%
group_by(
Config
) %>%
summarise(!!paste0(variable, ".median") := median(!!var))
}
mysumm('SN1')
# # A tibble: 2 x 2
# Config SN1.median
# <fct> <chr>
# 1 C1 SN1
# 2 C2 SN1
Related
I've written a function that takes multiple columns as its input that I'd like to apply to a grouped tibble, and I think that something with purrr::map might be the right approach, but I don't understand what the appropriate input is for the various map functions. Here's a dummy example:
myFun <- function(DF){
DF %>% mutate(MyOut = (A * B)) %>% pull(MyOut) %>% sum()
}
MyDF <- data.frame(A = 1:5, B = 6:10)
myFun(MyDF)
This works fine. But what if I want to add some grouping?
MyDF <- data.frame(A = 1:100, B = 1:100, Fruit = rep(c("Apple", "Mango"), each = 50))
MyDF %>% group_by(Fruit) %>% summarize(MyVal = myFun(.))
This doesn't work. I get the same value for every group in my data.frame or tibble. I then tried using something with purrr:
MyDF %>% group_by(Fruit) %>% map(.f = myFun)
Apparently, that's expecting character data as input, so that's not it.
This next variation is basically what I need, but the output is a list of lists rather than a tibble with one row for each value of Fruit:
MyDF %>% group_by(Fruit) %>% group_map(~ myFun(.))
We can use the OP's function in group_modify
library(dplyr)
MyDF %>%
group_by(Fruit) %>%
group_modify(~ .x %>%
summarise(MyVal = myFun(.x))) %>%
ungroup
-output
# A tibble: 2 × 2
Fruit MyVal
<chr> <int>
1 Apple 42925
2 Mango 295425
Or in group_map where the .y is the grouping column
MyDF %>%
group_by(Fruit) %>%
group_map(~ bind_cols(.y, MyVal = myFun(.))) %>%
bind_rows
# A tibble: 2 × 2
Fruit MyVal
<chr> <int>
1 Apple 42925
2 Mango 295425
I am trying to write a function that index variables names.
In particular, in my function, I use mutate to encode a variable that I have without changing its name. Does anyone knows how I can index a variable on the left end side of mutate?
Here is an example
library(tydiverse)
# first create relevant dataset
iris <- iris%>% group_by(Species) %>% mutate(mean_Length=mean(Sepal.Length))
# second create my function
userfunction <- function(var){
newdata <- iris %>%
select(mean_Length,{var}) %>% distinct() %>%
mutate(get(var)= # this is what causes my function to fail. How can i refer to the `var` here?
factor(get(var),get(var))) %>%
arrange(get(var)) #
return(newdata)
}
# this function produces the following error # Error: unexpected '}' in "}"
#note that if I change the reference to its original string the function works
userfunction2 <- function(var){
newdata <- iris %>%
select(mean_Length,{var}) %>% distinct() %>%
mutate(Species= # without reference it works, but I am unable to use the function for multiple variables.
factor(get(var),get(var))) %>%
arrange(get(var)) #
return(newdata)
}
encodedata<- userfunction2("Species")
Thanks a lot in advance for your help
Best
Here is a working example that goes into a similar direction as Limey's answer:
iris <- datasets::iris %>%
group_by(Species) %>%
mutate(mean_Length=mean(Sepal.Length)) %>%
ungroup()
userfunction <- function(var){
iris %>%
transmute(mean_Length, "temp" = iris[[var]]) %>%
distinct() %>%
mutate("{var}" := factor(temp)) %>%
arrange(temp) %>%
select(-temp)
}
userfunction("Petal.Length")
I don't think var is your problem. I think it's the =. If you you have a enquoted variable on the left hand side of the assignment (which is effectively what you do have with get()), you need :=, not =.
See here for more details.
I would have written your function slightly differently:
userfunction <- function(data, var){
qVar <- enquo(var)
newdata <- data %>%
select(mean_Length, !! qVar) %>% distinct() %>%
mutate(!! qVar := factor(!! qVar, !! qVar)) %>%
arrange(!! qVar)
return(newdata)
}
The inclusion of the data parameter means you can include it in a pipe:
encodedata <- iris %>% userfunction(Species)
encodedata
# A tibble: 3 x 2
# Groups: Species [3]
mean_Length Species
<dbl> <fct>
1 5.01 setosa
2 5.94 versicolor
3 6.59 virginica
On a fairly regular basis I want to pass in strings that function as arguments in code. For context, I often want a section where I can pass in filtering criteria or assumptions that then flow through my analysis, plots, etc. to make it more interactive.
A simple example is below. I've seen the eval/parse solution, but it seems like that makes code chunks unreadable. Is there a better/cleaner/shorter way to do this?
column.names <- c("group1", "group2") #two column names I want to be able to toggle between for grouping
select.column <- group.options[1] #Select the column for grouping
DataTable.summary <-
DataTable %>%
group_by(select.column) %>% #How do I pass that selection in here?
summarize(avg.price = mean(SALES.PRICE))
Well this is just a copy-paste from the tidyverse website: link:(https://dplyr.tidyverse.org/articles/programming.html#programming-recipes).
my_summarise <- function(df, group_var) {
group_var <- enquo(group_var)
print(group_var)
df %>%
group_by(!! group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, g1)
#> <quosure>
#> expr: ^g1
#> env: global
#> # A tibble: 2 x 2
#> g1 a
#> <dbl> <dbl>
#> 1 1 2.5
#> 2 2 3.33
But I think i illustrates your problem. I think what you really want to do is like the code above, i.e. create a function.
You can use the group_by_ function for the example in your question:
library(dplyr)
x <- data.frame(group1 = letters[1:4], group2 = LETTERS[1:4], value = 1:4)
select.colums <- c("group1", "group2")
x %>% group_by_(select.colums[2]) %>% summarize(avg = mean(value))
# A tibble: 4 x 2
# group2 avg
# <fct> <dbl>
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
The *_ family functions in dplyr might also offer a more general solution you are after, although the dplyr documentation says they are deprecated (?group_by_) and might disappear at some point. An analogous expression to the above solution using the tidy evaluation syntax seems to be:
x %>% group_by(!!sym(select.colums[2])) %>% summarize(avg = mean(value))
And for several columns:
x %>% group_by(!!!syms(select.colums)) %>% summarize(avg = mean(value))
This creates a symbol out of a string that is evaluated by dplyr.
I recommend using group_by_at(). It supports both single strings or character vectors:
nms <- c("cyl", "am")
mtcars %>% group_by_at(nms)
I am trying to refactor my R code (shown below) into Sparklyr R code to work on a spark dataset to get to the final result as shown in Table 1:
Using help from stack overflow post Gather in sparklyr and SparklyR separate one Spark Data Frame column into two columns I was able to reach all the way except last step dealing with Spread.
Need Help:
Implement Spread via SparklyR
Optimize code in any way
Table 1: Final output needed:
var n nmiss
1 Sepal.Length 150 0
2 Sepal.Width 150 0
R code to achieve it:
library(dplyr)
library(tidyr)
library(tibble)
data <- iris
data_tbl <- as_tibble(data)
profile <- data_tbl %>%
select(Sepal.Length,Sepal.Width) %>%
summarize_all(funs(
n = n(), #Count
nmiss=sum(as.numeric(is.na(.))) # MissingCount
)) %>%
gather(variable, value) %>%
separate(variable, c("var", "stat"), sep = "_(?=[^_]*$)") %>%
spread(stat, value)
Spark Code:
sdf_gather <- function(tbl){
all_cols <- colnames(tbl)
lapply(all_cols, function(col_nm){
tbl %>%
select(col_nm) %>%
mutate(key = col_nm) %>%
rename(value = col_nm)
}) %>%
sdf_bind_rows() %>%
select(c('key', 'value'))
}
profile <- data_tbl %>%
select(Sepal.Length,Sepal.Width ) %>%
summarize_all(funs(
n = n(),
nmiss=sum(as.numeric(is.na(.)))
)) %>%
sdf_gather(.) %>%
ft_regex_tokenizer(input_col="key", output_col="KeySplit", pattern="_(?=[^_]*$)") %>%
sdf_separate_column("KeySplit", into=c("var", "stat")) %>%
select(var,stat,value) %>%
sdf_register('profile')
In this specific case (in general where all columns have the same type, although if you're interested only in missing data statistics, this can be further relaxed) you can use much simpler structure than this.
With data defined like this:
df <- copy_to(sc, iris, overwrite = TRUE)
gather the columns (below I assume a function as defined in my answer to Gather in sparklyr)
long <- df %>%
select(Sepal_Length, Sepal_Width) %>%
sdf_gather("key", "value", "Sepal_Length", "Sepal_Width")
and then group and aggregate:
long %>%
group_by(key) %>%
summarise(n = n(), nmiss = sum(as.numeric(is.na(value)), na.rm=TRUE))
with result as:
# Source: spark<?> [?? x 3]
key n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0
Given reduced size of the output it is also fine to collect the result after aggregation
agg <- df %>%
select(Sepal_Length,Sepal_Width) %>%
summarize_all(funs(
n = n(),
nmiss=sum(as.numeric(is.na(.))) # MissingCount
)) %>% collect()
and apply your gather - spread logic on the result:
agg %>%
tidyr::gather(variable, value) %>%
tidyr::separate(variable, c("var", "stat"), sep = "_(?=[^_]*$)") %>%
tidyr::spread(stat, value)
# A tibble: 2 x 3
var n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0
In fact the latter approach should be superior performance-wise in this particular case.
I need to create a function that could group_by and summarise a data frame using the names of its columns.
I'm working with dplyr version 0.4.1 (and I cannot update), so it looks like the solutions I've found on the other topics doesn't work...
Here is my example :
data <- data.frame(section=rep(c("A","B"),3), quantity=c(6:11))
#I need to get this result :
RESULT = data %>% group_by(section) %>% summarise(total=sum(quantity))
I implemented this function, but I got an error :
# function :
synthetize = function(x,column,measure){
result = x %>% group_by(column) %>% summarise(total=sum(measure))
}
RESULT2=synthetize(data,column="section",measure="quantity")
RESULT2
I tried eval, get, but it looks like this doesn't help
We can convert the string to symbol with rlang::sym and evaluate (!!)
library(tidyverse)
synthetize = function(x, column, measure){
x %>%
group_by_at(column) %>%
summarise(total=sum(!! rlang::sym(measure)))
}
synthetize(data, column="section", measure="quantity")
# A tibble: 2 x 2
# section total
# <fct> <int>
#1 A 24
#2 B 27
NOTE: Here we use the OP's same argument type
If we are using older version of dplyr, may be the following would help
library(lazyeval)
synthetize2 = function(x, column, measure){
x %>%
group_by_(column) %>%
summarise(total = interp(~ sum(v1), v1 = as.name(measure)))
synthetize2(data, column='section', measure='quantity')
Another way is with enquo:
library(tidyverse)
synthetize = function(x,column,measure) {
result = x %>% group_by(!! enquo(column)) %>% summarise(total := sum(!! enquo(measure)))
}
In this case, you wouldn't need to quote the variables:
RESULT2 = synthetize(data, column = section, measure = quantity)
RESULT2
# A tibble: 2 x 2
section total
<fct> <int>
1 A 24
2 B 27
If you don't have access to newest tidyverse, try with get:
library(dplyr)
synthetize = function(x,column,measure) {
result = x %>% group_by(get(column)) %>% summarise(total := sum(get(measure)))
}