Repeatedly mutate variable using dplyr and purrr - r

I'm self-taught in R and this is my first StackOverflow question. I apologize if this is an obvious issue; please be kind.
Short Version of my Question
I wrote a custom function to calculate the percent change in a variable year over year. I would like to use purrr's map_at function to apply my custom function to a vector of variable names. My custom function works when applied to a single variable, but fails when I chain it using map_a
My custom function
calculate_delta <- function(df, col) {
#generate variable name
newcolname = paste("d", col, sep="")
#get formula for first difference.
calculate_diff <- lazyeval::interp(~(a + lag(a))/a, a = as.name(col))
#pass formula to mutate, name new variable the columname generated above
df %>%
mutate_(.dots = setNames(list(calculate_diff), newcolname)) }
When I apply this function to a single variable in the mtcars dataset, the output is as expected (although obviously the meaning of the result is non-sensical).
calculate_delta(mtcars, "wt")
Attempt to Apply the Function to a Character Vector Using Purrr
I think that I'm having trouble conceptualizing how map_at passes arguments to the function. All of the example snippets I can find online use map_at with functions like is.character, which don't require additional arguments. Here are my attempts at applying the function using purrr.
vars <- c("wt", "mpg")
mtcars %>% map_at(vars, calculate_delta)
This gives me this error message
Error in paste("d", col, sep = "") :
argument "col" is missing, with no default
I assume this is because map_at is passing vars as the df, and not passing an argument for col. To get around that issue, I tried the following:
vars <- c("wt", "mpg")
mtcars %>% map_at(vars, calculate_delta, df = .)
That throws me this error:
Error: unrecognised index type
I've monkeyed around with a bunch of different versions, including removing the df argument from the calculate_delta function, but I have had no luck.
Other potential solutions
1) A version of this using sapply, rather than purrr. I've tried solving the problem that way and had similar trouble. And my goal is to figure out a way to do this using purrr, if that is possible. Based on my understanding of purrr, this seems like a typical use case.
2) I can obviously think of how I would implement this using a for loop, but I'm trying to avoid that if possible for similar reasons.
Clearly I'm thinking about this wrong. Please help!
EDIT 1
To clarify, I am curious if there is a method of repeatedly transforming variables that accomplishes two things.
1) Generates new variables within the original tbl_df without replacing replace the columns being mutated (as is the case when using dplyr's mutate_at).
2) Automatically generates new variable labels.
3) If possible, accomplishes what I've described by applying a single function using map_at.
It may be that this is not possible, but I feel like there should be an elegant way to accomplish what I am describing.

Try simplifying the process:
delta <- function(x) (x + dplyr::lag(x)) /x
cols <- c("wt", "mpg")
#This
library(dplyr)
mtcars %>% mutate_at(cols, delta)
#Or
library(purrr)
mtcars %>% map_at(cols, delta)
#If necessary, in a function
f <- function(df, cols) {
df %>% mutate_at(cols, delta)
}
f(iris, c("Sepal.Width", "Petal.Length"))
f(mtcars, c("wt", "mpg"))
Edit
If you would like to embed new names after, we can write a custom pipe-ready function:
Rename <- function(object, old, new) {
names(object)[names(object) %in% old] <- new
object
}
mtcars %>%
mutate_at(cols, delta) %>%
Rename(cols, paste0("lagged",cols))
If you want to rename the resulting lagged variables:
mtcars %>% mutate_at(cols, funs(lagged = delta))

Related

Is it possible to use group_by in a function for more than one variable?

I created a function that aggregates the numeric values in a dataset, and I use a group_by() function to group the data first. Below is an example of what the code I wrote looks like. Is there a way I can group_by() more than one variable without having to create another input for the function?
agg <- function(data, group){ aggdata <- data %>% group_by({{group}}) %>% select_if(function(col) !is.numeric(col) & !is.integer(col)) %>% summarise_if(is.numeric, sum, na.rm = TRUE) return(aggdata)
Your code has (at least) a misplaced curly brace, and it's a bit difficult to see what you're trying to accomplish without a reproducible example and desired result.
It is possible to pass a vector of variable names to group_by(). For example, the following produces the same result as mtcars %>% group_by(cyl, gear):
my_groups <- c("cyl", "gear")
mtcars %>% group_by(!!!syms(my_groups))
Maybe you could use this syntax within your function definition.

How to use apply for functions that need "data$varname" vs functions that need just "varname"

Relatively new R user here that has been wrestling with making code more efficient for future uses, mainly trying out functions from the apply family.
Right now, I have a script in which I pull means from a large number of variables by (manually) creating a list of variable names and passing it into a sapply.
So this is an example of how I made a list of variable names and how I passed that into sapply
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
However, I now want to use a function that uses the argument format of function(varname, data), so I can't actually use that list of names I made. What I'm trying to do:
krusk <- sapply(vars, function(x) kruskal.test(x ~ group, data))
I feel like there is a way to pass my variable names into functions I have been completely overlooking, instead manually creating lists. Anyone have any suggestions?
This can work using iris dataset as data similar to great suggestion from #deschen:
#Vars
vars <- c("Sepal.Length", "Sepal.Width")
#Code
krusk <- sapply(vars, function(x) kruskal.test(iris[[x]] ~ iris[['Species']]))
Output:
krusk
Sepal.Length Sepal.Width
statistic 96.93744 63.57115
parameter 2 2
p.value 8.918734e-22 1.569282e-14
method "Kruskal-Wallis rank sum test" "Kruskal-Wallis rank sum test"
data.name "iris[[x]] by iris[["Species"]]" "iris[[x]] by iris[["Species"]]"
You were very close! You can do it by subsetting the data frame that you input to sapply using your vars vector, and changing the formula in kruskal.test:
vars <- c("Sepal.Length", "Sepal.Width")
sapply(iris[, vars], function(x) kruskal.test(x ~ iris$Species))
R is a very diverse coding language and there will ultimately be many ways to do the same thing. Some functions expect a more standard evaluation while others may use NSE (non-standard evaluation).
However, you seem to be asking about functions who just expect a single vector as impute as opposed to functions who have a data argument, in which you use variable as opposed to data$variable.
I have a few sidebars before I give some advice
Sidebar 1 - S3 Methods
While this may be besides the point in regards to the question, the function kruskal.test has two methods.
methods("kruskal.test")
#[1] kruskal.test.default* kruskal.test.formula*
#see '?methods' for accessing help and source code
Which method is used depends on the class of the first argument in the function. In this example you are passing a formula expression in which the data argument is necessary, whereas the default method just requires x and g arguments (which you could probably use your original pipelines).
So, if you're used to doing something one way, be sure to check the documentation of a function has different dispatch methods that will work for you.
Sidebar 2 - Data Frames
Data frames are really just a collection of vectors. The difference between f(data$variable) and f(x = variable, data = data), is that in the first the user explicitly tells R where to find the vector, while in the latter, it is up the the function f to evaluate x in the context of data.
I bring this up because of what I said in the beginning - there are many ways to do the same thing. So its generally up to you what you want your standard to be.
If you prefer to be explicit
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
krusk <- sapply(vars, kruskal.test, g = data$group)
or you can write your functions in which they are expected to be evaluated within a certain data.frame object
vars <- c("age", "gender", "PCLR")
means <- sapply(vars, function(x, id, data) fmean(data[[x]], id = data[[id]], na.rm=T), id = "group", data = data)
krusk <- sapply(vars, function(x, id, data) kruskal.test(data[[x]], data[[id]]), id = "group", data = data)
My Advice
I recommend looking into the following packages dplyr, tidyr, purrr. I'm sure there are a few things in these packages that will make your life easier.
For example, you have expressed having to manually make list before doing the sapply. In dplyr package you can possibly circumvent this if there is a condition to filter on.
data %>%
group_by(group) %>% #groups data
summarise_if(is.numeric, mean, na.rm = TRUE) #applys mean to every column that is a numeric vector
And similarly, we can summarise the results of the kurskal.test function if we reshape the data a bit.
data %>%
group_by(group) %>% #grouping to retain column in the next select statement
select_if(is.numeric) %>% # selecting all numeric columns
pivot_longer(cols = -group) %>% # all columns except "group" will be reshaped. Column names are stored in `name`, and values are stored in `values`
group_by(name) %>% #regroup on name variable (old numeric columns)
summarise(krusk = list(kruskal.test(value ~ as.factor(group)))) #perform test
I've only mentioned purrr because you can almost drop and replace all apply style functions with their map variants. purrr is very consistent across its function variants with a lot of options to control the output type.
I hope this helps and wish you luck on your coding adventures.

With dplyr and enquo my code works but not when I pass to purrr::map

I want to create a plot for each column in a vector called dates. My data frame contains only these columns and I want to group on it, count the occurrences and then plot it.
Below code works, except for map which I want to use to go across a previously unknown number of columns. I think I'm using map correctly, I've had success with it before. I'm new to using quosures but given that my function call works I'm not sure what is wrong. I've looked at several other posts that appear to be set up this way.
df <- data.frame(
date1 = c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02"),
date2 = c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02"),
stringsAsFactors = FALSE
)
dates<-names(df)
library(tidyverse)
dates.count<-function(.x){
group_by<-enquo(.x)
df %>% group_by(!!group_by) %>% summarise(count=n()) %>% ungroup() %>% ggplot() + geom_point(aes(y=count,x=!!group_by))
}
dates.count(date1)
map(dates,~dates.count(.x))
I get this error: Error in grouped_df_impl(data, unname(vars), drop) : Column .x is unknown
When you pass the variable names to map() you are using strings, which indicates you need ensym() instead of enquo().
So your function would look like
dates.count <- function(.x){
group_by = ensym(.x)
df %>%
group_by(!!group_by) %>%
summarise(count=n()) %>%
ungroup() %>%
ggplot() +
geom_point(aes(y=count,x=!!group_by))
}
And you would use the variable names as strings for the argument.
dates.count("date2")
Note that tidyeval doesn't always play nicely with the formula interface of map() (I think I'm remembering that correctly). You can always do an anonymous function instead, but in your case where you want to map the column names to a function with a single argument you can just do
map(dates, dates.count)
Using the formula interface in map() I needed an extra !!:
map(dates, ~dates.count(!!.x))

mutate_each_ non-standard evaluation

Really struggling with putting dplyr functions within my functions. I understand the function_ suffix for the standard evaluation versions, but still having problems, and seemingly tried all combinations of eval paste and lazy.
Trying to divide multiple columns by the median of the control for a group. Example data includes an additional column in iris named 'Control', so each species has 40 'normal', and 10 'control'.
data(iris)
control <- rep(c(rep("normal", 40), rep("control", 10)), 3)
iris$Control <- control
Normal dplyr works fine:
out_df <- iris %>%
group_by(Species) %>%
mutate_each(funs(./median(.[Control == "control"])), 1:4)
Trying to wrap this up into a function:
norm_iris <- function(df, control_col, control_val, species, num_cols = 1:4){
out <- df %>%
group_by_(species) %>%
mutate_each_(funs(./median(.[control_col == control])), num_cols)
return(out)
}
norm_iris(iris, control_col = "Control", control_val = "control", species = "Species")
I get the error:
Error in UseMethod("as.lazy_dots") :
no applicable method for 'as.lazy_dots' applied to an object of class "c('integer', 'numeric')"
Using funs_ instead of funs I get Error:...: need numeric data
If you haven't already, it might help you to read the vignette on standard evaluation here, although it sounds like some of this may be changing soon.
Your function is missing the use of interp from package lazyeval in the mutate_each_ line. Because you are trying to use a variable name (the Control variable) in the funs, you need funs_ in this situation along with interp. Notice that this is a situation where you don't need mutate_each_ at all. You would need it if you were trying to use column names instead of column numbers when selecting the columns you want to mutate.
Here is what the line would look like in your function instead of what you have:
mutate_each(funs_(interp(~./median(.[x == control_val]), x = as.name(control_col))),
num_cols)

get lhs object name when piping with dplyr

I'd like to have a function that can use pipe operator as exported from dplyr. I am not using magrittr.
df %>% my_function
How can I get df name? If I try
my_function <- function(tbl){print(deparse(substitute(tbl)))}
it returns
[1] "."
while I'd like to have
[1] "df"
Any suggestion?
Thank you in advance,
Nicola
The SO answer that JBGruber links to in the comments mostly solves the problem. It works by moving upwards through execution environments until a certain variable is found, then returns the lhs from that environment. The only thing missing is the requirement that the function outputs both the name of the original data frame and the manipulated data – I gleaned the latter requirement from one of the OP's comments. For that we just need to output a list containing these things, which we can do by modifying MrFlick's answer:
get_orig_name <- function(df){
i <- 1
while(!("chain_parts" %in% ls(envir=parent.frame(i))) && i < sys.nframe()) {
i <- i+1
}
list(name = deparse(parent.frame(i)$lhs), output = df)
}
Now we can run get_orig_name to the end of any pipeline to the get the manipulated data and the original data frame's name in a list. We access both using $:
mtcars %>% summarize_all(mean) %>% get_orig_name
#### OUTPUT ####
$name
[1] "mtcars"
$output
mpg cyl disp hp drat wt qsec vs am gear carb
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
I should also mention that, although I think the details of this strategy are interesting, I also think it is needlessly complicated. It sounds like the OP's goal is to manipulate the data and then write it to a file with the same name as the original, unmanipulated, data frame, which can easily be done using more straightforward methods. For example, if we are dealing with multiple data frames we can just do something like the following:
df_list <- list(mtcars = mtcars, iris = iris)
for(name in names(df_list)){
df_list[[name]] %>%
group_by_if(is.factor) %>%
summarise_all(mean) %>%
write.csv(paste0(name, ".csv"))
}
Here's a hacky way of doing it, which I'm sure breaks in a ton of edge cases:
library(data.table) # for the address function
# or parse .Internal(inspect if you feel masochistic
fn = function(tbl) {
objs = ls(parent.env(environment()))
objs[sapply(objs,
function(x) address(get(x, env = parent.env(environment()))) == address(tbl))]
}
df = data.frame(a = 1:10)
df %>% fn
#[1] "df"
Although the question is an old one, and the bounty has already been awarded, I would like to extend on gersht's excellent answer which works perfectly fine for getting the most lefthand-side object name. However, integrating this functionality in a dplyr workflow is not yet solved, apart from using this approach in the very last step of a pipe.
Since I'm using dplyr a lot, I have created a group of custom wrapper functions around the common dplyr verbs which I call metadplyr (I'm still playing around with the functionality, which is why I haven't uploaded it on github yet).
In essence, those functions create a new class called meta_tbl on top of a tibble and write certain things in the attributes of that object. Applied to the problem of the OP I provide a simple example with filter, but the procedure works on any other dplyr verb as well.
In my original function family I use slightly different names than dplyr, but the approach also works when 'overwriting' the original dplyr verbs.
Below is a new filter function which turns a data frame or tibble into a meta_tbl and writes the original name of the lhs object into the attribute .name. Here I am using a short version of gersht's approach.
library(dplyr)
filter <- function(.data, ...) {
if(!("meta_tbl" %in% class(.data))) {
.data2 <- as_tibble(.data)
# add new class 'meta_tbl' to data.frame
attr(.data2, "class") <- c(attr(.data2, "class"), "meta_tbl")
# write lhs original name into attributes
i <- 1
while(!("chain_parts" %in% ls(envir=parent.frame(i)))) {
i <- i+1
}
attr(.data2, ".name") <- deparse(parent.frame(i)$lhs)
}
dplyr::filter(.data2, ...)
}
For convenience it is good to have some helper function to extract the original name from the attributes easily.
.name <- function(.data) {
if("meta_tbl" %in% class(.data)) {
attr(.data, ".name")
} else stop("this function only work on objects of class 'meta_tbl'")
}
Both functions can be used in a workflow in the following way:
mtcars %>%
filter(gear == 4) %>%
write.csv(paste0(.name(.), ".csv"))
This might be a bad example, since the pipe doesn't continue, but in theory, we could use this pipe including the original name and pipe it in further function calls.
Inspired by the link mentioned by gersht
You can go back 5 generations to get the name
df %>% {parent.frame(5)$lhs}
example as below:
library(dplyr)
a <- 1
df1 <- data.frame(a = 1:10)
df2 <- data.frame(a = 1:10)
a %>% {parent.frame(5)$lhs}
df1 %>% {parent.frame(5)$lhs}
df2 %>% {parent.frame(5)$lhs}
I don't believe this is possible without adding an extra argument to your my_function. When chaining functions with dplyr it automatically converts the df to a tbl_df object, hence the new name "." within the dplyr scope to make the piping simpler.
The following is a very hacky way with dplyr which just adds an addition argument to return the name of the original data.frame
my_function <- function(tbl, orig.df){print(deparse(substitute(orig.df)))}
df %>% my_function(df)
[1] "df"
Note you couldn't just pass the df with your original function because the tbl_df object is automatically passed to all subsequent functions.

Resources