find grouping variable(s) from within a function called by do()

find grouping variable(s) from within a function called by do() - r

I am trying to call a plotting function on subgroups of a data.frame with dplyr::do(), producing one figure (ggplot object) per subgroup. I
want the title of each figure based on the grouping variable. To do this, my function needs to know what the grouping variable is.
Currently, what gets passed to do() as . is an object of class tbl_df
and data.frame. Without explicitly passing it as a separate variable, is there a way to inspect the data.frame directly to learn what was the grouping variable(s) is/are?
The solutions posted here calls for explicitly passing (each of) the grouping variables as an additional argument to the function. I'm wondering if there is a a more, elegant and general solution that is scaleable to varying numbers of grouping variables. While in this specific instance i'm interested in plotting, there are other other use cases where I want to know how the subgroups are defined from within the function called on each subgroup.
I don't want to want to guess by looking for columns where the
length(unique(col)) == 1 because that is going to lead to lots of false
positives with my data.
Is there an elegant way to do this?
Here is some sample code to get started.
library(ggplot2)
my_plot <- function(df) {
subgroup_name <- "" # ??
ggplot(aes(cty, hwy)) + geom_point() +
ggtitle(subgroup_name)
}
mpg %>%
group_by(manufacturer) %>%
do(my_plots = my_plot(.))

I don't think its possible to do this without passing the names of the grouping variable(s) into the function (I think the grouping variable "vars" attribute is lost after splitting the grouped_df data.frame, before executing the "do"). Here's an alternative solution that requires defining the grouping variable(s) in a vector before applying the dplyr group_by %>% do chain:
library(ggplot2)
library(dplyr)
my_plot <- function(df, group_vars) {
# get plot name from value(s) in grouping variable(s)
subgroup_name <- paste(df[1, group_vars], collapse = " ")
ggplot(data = df, aes(cty, hwy)) + geom_point() + ggtitle(subgroup_name)
}
group1 <- "manufacturer"
plots1 <-
mpg %>%
group_by_(.dots = group1) %>%
do(my_plots = my_plot(., group1))
plots1$my_plots[1]
group2 <- c("manufacturer", "year")
plots2 <-
mpg %>%
group_by_(.dots = group2) %>%
do(my_plots = my_plot(., group2))
plots2$my_plots[2]

Related

lapply strings to subset via brackets[]

There is almost certainly an easier way to go about this, but perhaps I've just been awake too long. I want to use the following vector of strings:
lap_list <- paste0(seq(1,length(mpg[[1]]),10), ":", seq(10,length(mpg[[1]]),10))
and use the vector to subset such as mpg[lap_list[1], ]. Alternatively, I could use dplyr for something with slice:
mpg %>%
slice(lap_list[1])
Both methods are giving the same error, and beyond parse(eval()) or as.numeric() I'm having a hard time wording my question for google.
The ultimate goal is to have a function such that I could lapply the graph outputs. Say:
barchart <- function(data_slice) {
mpg %>%
slice(data_slice) %>%
ggplot(aes(x=model)) + geom_bar()
}
lapply(lap_list, barchart)

If you paste the sequence of rows you want to subset using paste0, you don't have much option then to use eval(parse)) in some way or the other.
An alternative is to create a sequence of rows that you want to subset and store it in vectors. Pass them in Map to slice from the data and then plot.
library(dplyr)
library(ggplot2)
n <- nrow(mpg)
start <- seq(1,n,10)
#Added an extra `n` here to make the length of start and end equal
end <- c(seq(10,n,10), n)
barchart <- function(data, start, end) {
data %>%
slice(start:end) %>%
ggplot(aes(x=model)) + geom_bar()
}
list_of_plots <- Map(barchart, start, end, MoreArgs = list(data = mpg))
You can access each individual plots using list_of_plots[[1]], list_of_plots[[2]] etc.
Perhaps, you can also create groups of 10 rows and store the plots in the dataframe :
mpg %>%
group_by(grp = ceiling(row_number()/10)) %>%
summarise(plot = list(ggplot(cur_data(), aes(x=model)) + geom_bar()))

With dplyr and enquo my code works but not when I pass to purrr::map

I want to create a plot for each column in a vector called dates. My data frame contains only these columns and I want to group on it, count the occurrences and then plot it.
Below code works, except for map which I want to use to go across a previously unknown number of columns. I think I'm using map correctly, I've had success with it before. I'm new to using quosures but given that my function call works I'm not sure what is wrong. I've looked at several other posts that appear to be set up this way.
df <- data.frame(
date1 = c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02"),
date2 = c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02"),
stringsAsFactors = FALSE
)
dates<-names(df)
library(tidyverse)
dates.count<-function(.x){
group_by<-enquo(.x)
df %>% group_by(!!group_by) %>% summarise(count=n()) %>% ungroup() %>% ggplot() + geom_point(aes(y=count,x=!!group_by))
}
dates.count(date1)
map(dates,~dates.count(.x))
I get this error: Error in grouped_df_impl(data, unname(vars), drop) : Column .x is unknown

When you pass the variable names to map() you are using strings, which indicates you need ensym() instead of enquo().
So your function would look like
dates.count <- function(.x){
group_by = ensym(.x)
df %>%
group_by(!!group_by) %>%
summarise(count=n()) %>%
ungroup() %>%
ggplot() +
geom_point(aes(y=count,x=!!group_by))
}
And you would use the variable names as strings for the argument.
dates.count("date2")
Note that tidyeval doesn't always play nicely with the formula interface of map() (I think I'm remembering that correctly). You can always do an anonymous function instead, but in your case where you want to map the column names to a function with a single argument you can just do
map(dates, dates.count)
Using the formula interface in map() I needed an extra !!:
map(dates, ~dates.count(!!.x))

How do pipes work with purrr map() function and the "." (dot) symbol

When using both pipes and the map() function from purrr, I am confused about how data and variables are passed along. For instance, this code works as I expect:
library(tidyverse)
cars %>%
select_if(is.numeric) %>%
map(~hist(.))
Yet, when I try something similar using ggplot, it behaves in a strange way.
cars %>%
select_if(is.numeric) %>%
map(~ggplot(cars, aes(.)) + geom_histogram())
I'm guessing this is because the "." in this case is passing a vector to aes(), which is expecting a column name. Either way, I wish I could pass each numeric column to a ggplot function using pipes and map(). Thanks in advance!

cars %>%
select_if(is.numeric) %>%
map2(., names(.),
~{ggplot(data_frame(var = .x), aes(var)) +
geom_histogram() +
labs(x = .y) })
# Alternate version
cars %>%
select_if(is.numeric) %>%
imap(.,
~{ggplot(data_frame(var = .x), aes(var)) +
geom_histogram() +
labs(x = .y) })
There's a few extra steps.
Use map2 instead of map. The first argument is the dataframe you're passing it, and the second argument is a vector of the names of that dataframe, so it knows what to map over. (Alternately, imap(x, ...) is a synonym for map2(x, names(x), ...). It's an "index-map", hence "imap".).
You then need to explicitly enframe your data, since ggplot only works on dataframes and coercible objects.
This also gives you access to the .y pronoun to name the plots.

You aren't supposed to pass raw data to an aesthetic mapping. Instead you should dynamically build the data.frame. For example
cars %>%
select_if(is.numeric) %>%
map(~ggplot(data_frame(x=.), aes(x)) + geom_histogram())

How can you obtain the group_by value for use in passing to a function?

I am trying to use dplyr to apply a function to a data frame that is grouped using the group_by function. I am applying a function to each row of the grouped data using do(). I would like to obtain the value of the group_by variable so that I might use it in a function call.
So, effectively, I have-
tmp <-
my_data %>%
group_by(my_grouping_variable) %>%
do(my_function_call(data.frame(x = .$X, y = .$Y),
GROUP_BY_VARIABLE)
I'm sure that I could call unique and get it...
do(my_function_call(data.frame(x = .$X, y = .$Y),
unique(.$my_grouping_variable))
But, it seems clunky and would inefficiently call unique for every grouping value.
Is there a way to get the value of the group_by variable in dplyr?
I'm going to prematurely say sorry if this is a crazy easy thing to answer. I promise that I've exhaustively searched for an answer.

First, if necessary, check if it's a grouped data frame: inherits(data, "grouped_df").
If you want the subsets of data frames, you could nest the groups:
mtcars %>% group_by(cyl) %>% nest()
Usually, you won't nest within the pipe-chain, but check in your function:
your_function(.x) <- function(x) {
if(inherits(x, "grouped_df")) x <- nest(x)
}
Your function should then iterate over the list-column data with all grouped subsets. If you use a function within mutate, e.g.
mtcars %>% group_by(cyl) %>% mutate(abc = your_function_call(.x))
then note that your function directly receives the values for each group, passed as class structure. It's a bit difficult to explain, just try it out and debug your_function_call step by step...

You can use groups(), however a SE version of this does not exist so I'm unsure of its use in programming.
library(dplyr)
df <- mtcars %>% group_by(cyl, mpg)
groups(df)
[[1]]
cyl
[[2]]
mpg

Repeatedly mutate variable using dplyr and purrr

I'm self-taught in R and this is my first StackOverflow question. I apologize if this is an obvious issue; please be kind.
Short Version of my Question
I wrote a custom function to calculate the percent change in a variable year over year. I would like to use purrr's map_at function to apply my custom function to a vector of variable names. My custom function works when applied to a single variable, but fails when I chain it using map_a
My custom function
calculate_delta <- function(df, col) {
#generate variable name
newcolname = paste("d", col, sep="")
#get formula for first difference.
calculate_diff <- lazyeval::interp(~(a + lag(a))/a, a = as.name(col))
#pass formula to mutate, name new variable the columname generated above
df %>%
mutate_(.dots = setNames(list(calculate_diff), newcolname)) }
When I apply this function to a single variable in the mtcars dataset, the output is as expected (although obviously the meaning of the result is non-sensical).
calculate_delta(mtcars, "wt")
Attempt to Apply the Function to a Character Vector Using Purrr
I think that I'm having trouble conceptualizing how map_at passes arguments to the function. All of the example snippets I can find online use map_at with functions like is.character, which don't require additional arguments. Here are my attempts at applying the function using purrr.
vars <- c("wt", "mpg")
mtcars %>% map_at(vars, calculate_delta)
This gives me this error message
Error in paste("d", col, sep = "") :
argument "col" is missing, with no default
I assume this is because map_at is passing vars as the df, and not passing an argument for col. To get around that issue, I tried the following:
vars <- c("wt", "mpg")
mtcars %>% map_at(vars, calculate_delta, df = .)
That throws me this error:
Error: unrecognised index type
I've monkeyed around with a bunch of different versions, including removing the df argument from the calculate_delta function, but I have had no luck.
Other potential solutions
1) A version of this using sapply, rather than purrr. I've tried solving the problem that way and had similar trouble. And my goal is to figure out a way to do this using purrr, if that is possible. Based on my understanding of purrr, this seems like a typical use case.
2) I can obviously think of how I would implement this using a for loop, but I'm trying to avoid that if possible for similar reasons.
Clearly I'm thinking about this wrong. Please help!
EDIT 1
To clarify, I am curious if there is a method of repeatedly transforming variables that accomplishes two things.
1) Generates new variables within the original tbl_df without replacing replace the columns being mutated (as is the case when using dplyr's mutate_at).
2) Automatically generates new variable labels.
3) If possible, accomplishes what I've described by applying a single function using map_at.
It may be that this is not possible, but I feel like there should be an elegant way to accomplish what I am describing.

Try simplifying the process:
delta <- function(x) (x + dplyr::lag(x)) /x
cols <- c("wt", "mpg")
#This
library(dplyr)
mtcars %>% mutate_at(cols, delta)
#Or
library(purrr)
mtcars %>% map_at(cols, delta)
#If necessary, in a function
f <- function(df, cols) {
df %>% mutate_at(cols, delta)
}
f(iris, c("Sepal.Width", "Petal.Length"))
f(mtcars, c("wt", "mpg"))
Edit
If you would like to embed new names after, we can write a custom pipe-ready function:
Rename <- function(object, old, new) {
names(object)[names(object) %in% old] <- new
object
}
mtcars %>%
mutate_at(cols, delta) %>%
Rename(cols, paste0("lagged",cols))
If you want to rename the resulting lagged variables:
mtcars %>% mutate_at(cols, funs(lagged = delta))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

find grouping variable(s) from within a function called by do() - r

Related

lapply strings to subset via brackets[]

With dplyr and enquo my code works but not when I pass to purrr::map

How do pipes work with purrr map() function and the "." (dot) symbol

How can you obtain the group_by value for use in passing to a function?

Repeatedly mutate variable using dplyr and purrr

Categories

Resources