How can I make a function that takes a column and uses that in dplyr, tidyr and ggplot?
df <- data.frame(date_col = c(1,1,2,2,3,4,4,5,5),
col_a = c('a','b','a','b','a','a','b','a','b'),
val_col = runif(9))
How do I write a function takes a parameter param_col instead of the hardcoded col_a
df %>%
group_by(date_col, col_a) %>%
summarise(val_col = sum(val_col)) %>%
complete(col_a, date_col) %>%
ggplot(aes(date_col, val_col, color = col_a)) +
geom_line()
The dplyr and ggplot calls work in the code outlined below. But how should the complete call be written? Or should complete_ be used?
Is there a more canonical way of doing this?
plot_nice_chart <- function(df, param_col) {
enq_param_col <- enquo(param_col)
str_param_col <- deparse(substitute(param_col))
# aggregate data based on group_by_col,
# explicitly fill in NA's for missing to avoid interpolation
df %>%
group_by(!!enq_param_col, date_col) %>%
summarise(val_col = sum(val_col)) %>%
complete(<what-should-be-here?>, date_col) %>%
ggplot(aes_string("date_col", "val_col", color = str_param_col)) +
geom_line()
}
The development version of tidyr, tidyr_0.6.3.9000, now uses tidyeval, so if you want to update to that you could use !! as you did in group_by.
plot_nice_chart <- function(df, param_col) {
enq_param_col <- enquo(param_col)
str_param_col <- deparse(substitute(param_col))
str_param_col
df %>%
group_by(!!enq_param_col, date_col) %>%
summarise(val_col = sum(val_col)) %>%
ungroup() %>%
complete(!!enq_param_col, date_col) %>%
ggplot(aes_string("date_col", "val_col", color = str_param_col)) +
geom_line()
}
Using the current version, you can use complete_ with variables as strings.
plot_nice_chart <- function(df, param_col) {
enq_param_col <- enquo(param_col)
str_param_col <- deparse(substitute(param_col))
df %>%
group_by(!!enq_param_col, date_col) %>%
summarise(val_col = sum(val_col)) %>%
ungroup() %>%
complete_( c(str_param_col, "date_col") ) %>%
ggplot(aes_string("date_col", "val_col", color = str_param_col)) +
geom_line()
}
Related
I have a dataframe with a list of scores of students for a few subjects (each subject represented by a column) I want to do the calculation below for each subject (Math, Science and Reading)
avgdata_math <- data%>%
group_by(country) %>%
summarise(ci = list(bootstrap_ci(sex, Math, weight))) %>%
unnest_wider(ci) %>%
ungroup() %>%
mutate(country = fct_reorder(country, avg))
Since I have to repeat the same code twice I want to write a function to do the calculation (without pivoting the dataframe)
aus_nz <- function(df, subject = "Math") {
df %>%
group_by(country) %>%
summarise(ci = list(bootstrap_ci(sex, subject, weight))) %>%
unnest_wider(ci) %>%
ungroup() %>%
mutate(country = fct_reorder(country, avg))
}
This gives me an error, since I've passed the column name(subject) as a string, then grouped data and thereafter used a string value in calling the bootstrap_ci function, whereas it should be a column of data passed there (which should be after the group operation).
Using !! rlang::ensym(subject) in your function should work.
aus_nz <- function(df, subject = "Math") {
df %>%
group_by(country) %>%
summarise(ci = list(bootstrap_ci(sex, !! rlang::ensym(subject), weight))) %>%
unnest_wider(ci) %>%
ungroup() %>%
mutate(country = fct_reorder(country, avg))
}
Update
If you also want to pass the grouping variable as a string into the function and if you sometimes have more than one variable you want to group by, then using !!!, rlang::ensyms() and the ellipsis ... argument would do the trick, if it not were for your last line of your function. fct_reorder only expects one variable. In case of two grouping variables: what would you do? Create two new variables and reorder each grouping variable by avg? It would also be helpful to see your data (maybe with dput(head(...))).
aus_nz <- function(df, subject = "Math", ...) {
group_var <- rlang::ensyms(...)
df %>%
group_by(!!! group_var) %>%
summarise(ci = list(bootstrap_ci(sex, !! rlang::ensym(subject), weight))) %>%
unnest_wider(ci) %>%
ungroup() # %>% last line needs to be fixed
# mutate(grouped_by = fct_reorder(!!! group_var, avg))
}
If you do not want to use the ellipsis argument, you can use rlang::syms and a character vector (with one or multiple elements) instead:
aus_nz <- function(df, subject = "Math", group = "country") {
group_var <- rlang::syms(group)
df %>%
group_by(!!! group_var) %>%
summarise(ci = list(bootstrap_ci(sex, !! rlang::ensym(subject), weight))) %>%
unnest_wider(ci) %>%
ungroup() # %>% last line needs to be fixed
# mutate(grouped_by = fct_reorder(!!! group_var, avg))
}
I'm trying to dynamically create an extra column. The first piece of code works as i want it to:
library(dplyr)
library(tidyr)
set.seed(1)
df <- data.frame(animals = sample(c('dog', 'cat', 'rat'), 100, replace = T))
my_fun <- function(data, column_name){
data %>% group_by(animals) %>%
summarise(!!column_name := n())
}
my_fun(df, 'frequency')
Here i also use the complete function and it doesn't work:
library(dplyr)
set.seed(1)
df <- data.frame(animals = sample(c('dog', 'cat', 'rat'), 100, replace = T))
my_fun <- function(data, column_name){
data %>% group_by(animals) %>%
summarise(!!column_name := n())%>%
ungroup() %>%
complete(animals = c('dog', 'cat', 'rat', 'bat'),
fill = list(!!column_name := 0))
}
my_fun(df, 'frequency')
The list function doesn't seem to like !!column_name :=
Is there something i can do to make this work? Basically i want the second piece of code to output:
animals frequency
bat 0
cat 38
dog 27
rat 35
You could keep the fill argument of complete() as the default (which will give you the missing values as NA) and subsequently replace them with 0:
my_fun <- function(data, column_name){
data %>%
group_by(animals) %>%
summarise(!!column_name := n())%>%
ungroup() %>%
complete(animals = c('dog', 'cat', 'rat', 'bat')) %>%
mutate_all(~replace(., is.na(.), 0))
}
I have the following script. Option 1 uses a long format and group_by to identify the first step of many where the status equals 0.
Another option (2) is to use apply to calculate this value for each row, and then transform the data to a long format.
The firs option does not scale well. The second does, but I was unable to get it into a dplyr pipe. I tried to solve this with purrr but did not succeeed.
Questions:
Why does the first option not scale well?
How can I transform the second option in a dplyr pipe?
require(dplyr)
require(tidyr)
require(ggplot2)
set.seed(314)
# example data
dat <- as.data.frame(matrix(sample(c(0,1),
size = 9000000,
replace = TRUE,
prob = c(5,95)),
ncol = 9))
names(dat) <- paste("step",1:9, sep="_")
steps <- dat %>% select(starts_with("step_")) %>% names()
# option 1 is slow
dat.cum <- dat %>%
mutate(id = row_number()) %>%
gather(step, status,-id) %>%
group_by(id) %>%
mutate(drop = min(if_else(status==0,match(step, steps),99L))) %>%
mutate(status = if_else(match(step, steps)>=drop,0,1))
ggplot(dat.cum, aes(x = step, fill = factor(status))) +
geom_bar()
# option 2 is faster
dat$drop <- apply(dat,1,function(x) min(which(x==0),99))
dat.cum <- dat %>%
gather(step,status,-drop) %>%
mutate(status = if_else(match(step,steps)>=drop,0,1))
ggplot(dat.cum, aes(x = step, fill = factor(status))) +
geom_bar()
If you would like to map along rows you could do:
dat %>%
mutate(drop2 = map_int(seq_len(nrow(dat)), ~ min(which(dat[.x, ] == 0L), 99L)))
It could be that "gathering and grouping" is faster than Looping:
dat %>%
as_tibble() %>%
select(starts_with("step_")) %>%
mutate(row_nr = row_number()) %>%
gather(key = "col", value = "value", -row_nr) %>%
arrange(row_nr, col) %>%
group_by(row_nr) %>%
mutate(col_index = row_number()) %>%
filter(value == 0) %>%
summarise(drop3 = min(col_index)) %>%
ungroup() %>%
right_join(dat %>%
mutate(row_nr = row_number()),
by = "row_nr") %>%
mutate(drop3 = if_else(is.na(drop3), 99, drop3))
dplyr programming question here. Trying to write a dplyr function which takes column names as inputs and also filters on a component outlined in the function. What I am trying to recreate is as follow called test:
#test df
x<- sample(1:100, 10)
y<- sample(c(TRUE, FALSE), 10, replace = TRUE)
date<- seq(as.Date("2018-01-01"), as.Date("2018-01-10"), by =1)
my_df<- data.frame(x = x, y =y, date =date)
test<- my_df %>% group_by(date) %>%
summarise(total = n(), total_2 = sum(y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter(date >= "2018-01-03")
The function I am testing is as follows:
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- enquo(cumulative_y)
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(!!cumulative_y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data = my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-03")
I have looked looked at some examples of using enquo and this thread gets me half way there:
Use variable names in functions of dplyr
But the issue is I get two different data frame outputs for test 1 and test 2. The one from the function outputs does not have data from the logical y referenced column.
I also tried this instead
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- deparse(substitute(cumulative_y))
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(data[[cumulative_y]] ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data= my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-04")
Based on this thread: Pass a data.frame column name to a function
But the output from my test 2 column is also wildly different and it seems to do some kind or recursive accumulation. Which again is different to my test date frame.
If anyone can help that would be much appreciated.
Since I have to use a function in a loop, I have to use dplyr group_by_at() and summarise_at() function. Unfortunately, I am not able to use the complete function from plyr to prevent empty groups to be removed by using an Index. Or is there another option to prevent dplyr from dropping empty groups?
library(dplyr)
library(plyr)
df1 <- mtcars %>%
group_by(gear) %>%
summarise(Mittelwert = mean(mpg, na.rm = TRUE)) %>%
complete(gear, fill = list(Gewicht = 1))
df1
df2 <- mtcars %>%
group_by_at(10) %>%
summarise_at(1, mean, na.rm = TRUE) %>%
complete(gear, fill = list(Gewicht = 1))