get lhs object name when piping with dplyr - r

I'd like to have a function that can use pipe operator as exported from dplyr. I am not using magrittr.
df %>% my_function
How can I get df name? If I try
my_function <- function(tbl){print(deparse(substitute(tbl)))}
it returns
[1] "."
while I'd like to have
[1] "df"
Any suggestion?
Thank you in advance,
Nicola

The SO answer that JBGruber links to in the comments mostly solves the problem. It works by moving upwards through execution environments until a certain variable is found, then returns the lhs from that environment. The only thing missing is the requirement that the function outputs both the name of the original data frame and the manipulated data – I gleaned the latter requirement from one of the OP's comments. For that we just need to output a list containing these things, which we can do by modifying MrFlick's answer:
get_orig_name <- function(df){
i <- 1
while(!("chain_parts" %in% ls(envir=parent.frame(i))) && i < sys.nframe()) {
i <- i+1
}
list(name = deparse(parent.frame(i)$lhs), output = df)
}
Now we can run get_orig_name to the end of any pipeline to the get the manipulated data and the original data frame's name in a list. We access both using $:
mtcars %>% summarize_all(mean) %>% get_orig_name
#### OUTPUT ####
$name
[1] "mtcars"
$output
mpg cyl disp hp drat wt qsec vs am gear carb
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
I should also mention that, although I think the details of this strategy are interesting, I also think it is needlessly complicated. It sounds like the OP's goal is to manipulate the data and then write it to a file with the same name as the original, unmanipulated, data frame, which can easily be done using more straightforward methods. For example, if we are dealing with multiple data frames we can just do something like the following:
df_list <- list(mtcars = mtcars, iris = iris)
for(name in names(df_list)){
df_list[[name]] %>%
group_by_if(is.factor) %>%
summarise_all(mean) %>%
write.csv(paste0(name, ".csv"))
}

Here's a hacky way of doing it, which I'm sure breaks in a ton of edge cases:
library(data.table) # for the address function
# or parse .Internal(inspect if you feel masochistic
fn = function(tbl) {
objs = ls(parent.env(environment()))
objs[sapply(objs,
function(x) address(get(x, env = parent.env(environment()))) == address(tbl))]
}
df = data.frame(a = 1:10)
df %>% fn
#[1] "df"

Although the question is an old one, and the bounty has already been awarded, I would like to extend on gersht's excellent answer which works perfectly fine for getting the most lefthand-side object name. However, integrating this functionality in a dplyr workflow is not yet solved, apart from using this approach in the very last step of a pipe.
Since I'm using dplyr a lot, I have created a group of custom wrapper functions around the common dplyr verbs which I call metadplyr (I'm still playing around with the functionality, which is why I haven't uploaded it on github yet).
In essence, those functions create a new class called meta_tbl on top of a tibble and write certain things in the attributes of that object. Applied to the problem of the OP I provide a simple example with filter, but the procedure works on any other dplyr verb as well.
In my original function family I use slightly different names than dplyr, but the approach also works when 'overwriting' the original dplyr verbs.
Below is a new filter function which turns a data frame or tibble into a meta_tbl and writes the original name of the lhs object into the attribute .name. Here I am using a short version of gersht's approach.
library(dplyr)
filter <- function(.data, ...) {
if(!("meta_tbl" %in% class(.data))) {
.data2 <- as_tibble(.data)
# add new class 'meta_tbl' to data.frame
attr(.data2, "class") <- c(attr(.data2, "class"), "meta_tbl")
# write lhs original name into attributes
i <- 1
while(!("chain_parts" %in% ls(envir=parent.frame(i)))) {
i <- i+1
}
attr(.data2, ".name") <- deparse(parent.frame(i)$lhs)
}
dplyr::filter(.data2, ...)
}
For convenience it is good to have some helper function to extract the original name from the attributes easily.
.name <- function(.data) {
if("meta_tbl" %in% class(.data)) {
attr(.data, ".name")
} else stop("this function only work on objects of class 'meta_tbl'")
}
Both functions can be used in a workflow in the following way:
mtcars %>%
filter(gear == 4) %>%
write.csv(paste0(.name(.), ".csv"))
This might be a bad example, since the pipe doesn't continue, but in theory, we could use this pipe including the original name and pipe it in further function calls.

Inspired by the link mentioned by gersht
You can go back 5 generations to get the name
df %>% {parent.frame(5)$lhs}
example as below:
library(dplyr)
a <- 1
df1 <- data.frame(a = 1:10)
df2 <- data.frame(a = 1:10)
a %>% {parent.frame(5)$lhs}
df1 %>% {parent.frame(5)$lhs}
df2 %>% {parent.frame(5)$lhs}

I don't believe this is possible without adding an extra argument to your my_function. When chaining functions with dplyr it automatically converts the df to a tbl_df object, hence the new name "." within the dplyr scope to make the piping simpler.
The following is a very hacky way with dplyr which just adds an addition argument to return the name of the original data.frame
my_function <- function(tbl, orig.df){print(deparse(substitute(orig.df)))}
df %>% my_function(df)
[1] "df"
Note you couldn't just pass the df with your original function because the tbl_df object is automatically passed to all subsequent functions.

Related

How to use apply for functions that need "data$varname" vs functions that need just "varname"

Relatively new R user here that has been wrestling with making code more efficient for future uses, mainly trying out functions from the apply family.
Right now, I have a script in which I pull means from a large number of variables by (manually) creating a list of variable names and passing it into a sapply.
So this is an example of how I made a list of variable names and how I passed that into sapply
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
However, I now want to use a function that uses the argument format of function(varname, data), so I can't actually use that list of names I made. What I'm trying to do:
krusk <- sapply(vars, function(x) kruskal.test(x ~ group, data))
I feel like there is a way to pass my variable names into functions I have been completely overlooking, instead manually creating lists. Anyone have any suggestions?
This can work using iris dataset as data similar to great suggestion from #deschen:
#Vars
vars <- c("Sepal.Length", "Sepal.Width")
#Code
krusk <- sapply(vars, function(x) kruskal.test(iris[[x]] ~ iris[['Species']]))
Output:
krusk
Sepal.Length Sepal.Width
statistic 96.93744 63.57115
parameter 2 2
p.value 8.918734e-22 1.569282e-14
method "Kruskal-Wallis rank sum test" "Kruskal-Wallis rank sum test"
data.name "iris[[x]] by iris[["Species"]]" "iris[[x]] by iris[["Species"]]"
You were very close! You can do it by subsetting the data frame that you input to sapply using your vars vector, and changing the formula in kruskal.test:
vars <- c("Sepal.Length", "Sepal.Width")
sapply(iris[, vars], function(x) kruskal.test(x ~ iris$Species))
R is a very diverse coding language and there will ultimately be many ways to do the same thing. Some functions expect a more standard evaluation while others may use NSE (non-standard evaluation).
However, you seem to be asking about functions who just expect a single vector as impute as opposed to functions who have a data argument, in which you use variable as opposed to data$variable.
I have a few sidebars before I give some advice
Sidebar 1 - S3 Methods
While this may be besides the point in regards to the question, the function kruskal.test has two methods.
methods("kruskal.test")
#[1] kruskal.test.default* kruskal.test.formula*
#see '?methods' for accessing help and source code
Which method is used depends on the class of the first argument in the function. In this example you are passing a formula expression in which the data argument is necessary, whereas the default method just requires x and g arguments (which you could probably use your original pipelines).
So, if you're used to doing something one way, be sure to check the documentation of a function has different dispatch methods that will work for you.
Sidebar 2 - Data Frames
Data frames are really just a collection of vectors. The difference between f(data$variable) and f(x = variable, data = data), is that in the first the user explicitly tells R where to find the vector, while in the latter, it is up the the function f to evaluate x in the context of data.
I bring this up because of what I said in the beginning - there are many ways to do the same thing. So its generally up to you what you want your standard to be.
If you prefer to be explicit
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
krusk <- sapply(vars, kruskal.test, g = data$group)
or you can write your functions in which they are expected to be evaluated within a certain data.frame object
vars <- c("age", "gender", "PCLR")
means <- sapply(vars, function(x, id, data) fmean(data[[x]], id = data[[id]], na.rm=T), id = "group", data = data)
krusk <- sapply(vars, function(x, id, data) kruskal.test(data[[x]], data[[id]]), id = "group", data = data)
My Advice
I recommend looking into the following packages dplyr, tidyr, purrr. I'm sure there are a few things in these packages that will make your life easier.
For example, you have expressed having to manually make list before doing the sapply. In dplyr package you can possibly circumvent this if there is a condition to filter on.
data %>%
group_by(group) %>% #groups data
summarise_if(is.numeric, mean, na.rm = TRUE) #applys mean to every column that is a numeric vector
And similarly, we can summarise the results of the kurskal.test function if we reshape the data a bit.
data %>%
group_by(group) %>% #grouping to retain column in the next select statement
select_if(is.numeric) %>% # selecting all numeric columns
pivot_longer(cols = -group) %>% # all columns except "group" will be reshaped. Column names are stored in `name`, and values are stored in `values`
group_by(name) %>% #regroup on name variable (old numeric columns)
summarise(krusk = list(kruskal.test(value ~ as.factor(group)))) #perform test
I've only mentioned purrr because you can almost drop and replace all apply style functions with their map variants. purrr is very consistent across its function variants with a lot of options to control the output type.
I hope this helps and wish you luck on your coding adventures.

How can you obtain the group_by value for use in passing to a function?

I am trying to use dplyr to apply a function to a data frame that is grouped using the group_by function. I am applying a function to each row of the grouped data using do(). I would like to obtain the value of the group_by variable so that I might use it in a function call.
So, effectively, I have-
tmp <-
my_data %>%
group_by(my_grouping_variable) %>%
do(my_function_call(data.frame(x = .$X, y = .$Y),
GROUP_BY_VARIABLE)
I'm sure that I could call unique and get it...
do(my_function_call(data.frame(x = .$X, y = .$Y),
unique(.$my_grouping_variable))
But, it seems clunky and would inefficiently call unique for every grouping value.
Is there a way to get the value of the group_by variable in dplyr?
I'm going to prematurely say sorry if this is a crazy easy thing to answer. I promise that I've exhaustively searched for an answer.
First, if necessary, check if it's a grouped data frame: inherits(data, "grouped_df").
If you want the subsets of data frames, you could nest the groups:
mtcars %>% group_by(cyl) %>% nest()
Usually, you won't nest within the pipe-chain, but check in your function:
your_function(.x) <- function(x) {
if(inherits(x, "grouped_df")) x <- nest(x)
}
Your function should then iterate over the list-column data with all grouped subsets. If you use a function within mutate, e.g.
mtcars %>% group_by(cyl) %>% mutate(abc = your_function_call(.x))
then note that your function directly receives the values for each group, passed as class structure. It's a bit difficult to explain, just try it out and debug your_function_call step by step...
You can use groups(), however a SE version of this does not exist so I'm unsure of its use in programming.
library(dplyr)
df <- mtcars %>% group_by(cyl, mpg)
groups(df)
[[1]]
cyl
[[2]]
mpg

Repeatedly mutate variable using dplyr and purrr

I'm self-taught in R and this is my first StackOverflow question. I apologize if this is an obvious issue; please be kind.
Short Version of my Question
I wrote a custom function to calculate the percent change in a variable year over year. I would like to use purrr's map_at function to apply my custom function to a vector of variable names. My custom function works when applied to a single variable, but fails when I chain it using map_a
My custom function
calculate_delta <- function(df, col) {
#generate variable name
newcolname = paste("d", col, sep="")
#get formula for first difference.
calculate_diff <- lazyeval::interp(~(a + lag(a))/a, a = as.name(col))
#pass formula to mutate, name new variable the columname generated above
df %>%
mutate_(.dots = setNames(list(calculate_diff), newcolname)) }
When I apply this function to a single variable in the mtcars dataset, the output is as expected (although obviously the meaning of the result is non-sensical).
calculate_delta(mtcars, "wt")
Attempt to Apply the Function to a Character Vector Using Purrr
I think that I'm having trouble conceptualizing how map_at passes arguments to the function. All of the example snippets I can find online use map_at with functions like is.character, which don't require additional arguments. Here are my attempts at applying the function using purrr.
vars <- c("wt", "mpg")
mtcars %>% map_at(vars, calculate_delta)
This gives me this error message
Error in paste("d", col, sep = "") :
argument "col" is missing, with no default
I assume this is because map_at is passing vars as the df, and not passing an argument for col. To get around that issue, I tried the following:
vars <- c("wt", "mpg")
mtcars %>% map_at(vars, calculate_delta, df = .)
That throws me this error:
Error: unrecognised index type
I've monkeyed around with a bunch of different versions, including removing the df argument from the calculate_delta function, but I have had no luck.
Other potential solutions
1) A version of this using sapply, rather than purrr. I've tried solving the problem that way and had similar trouble. And my goal is to figure out a way to do this using purrr, if that is possible. Based on my understanding of purrr, this seems like a typical use case.
2) I can obviously think of how I would implement this using a for loop, but I'm trying to avoid that if possible for similar reasons.
Clearly I'm thinking about this wrong. Please help!
EDIT 1
To clarify, I am curious if there is a method of repeatedly transforming variables that accomplishes two things.
1) Generates new variables within the original tbl_df without replacing replace the columns being mutated (as is the case when using dplyr's mutate_at).
2) Automatically generates new variable labels.
3) If possible, accomplishes what I've described by applying a single function using map_at.
It may be that this is not possible, but I feel like there should be an elegant way to accomplish what I am describing.
Try simplifying the process:
delta <- function(x) (x + dplyr::lag(x)) /x
cols <- c("wt", "mpg")
#This
library(dplyr)
mtcars %>% mutate_at(cols, delta)
#Or
library(purrr)
mtcars %>% map_at(cols, delta)
#If necessary, in a function
f <- function(df, cols) {
df %>% mutate_at(cols, delta)
}
f(iris, c("Sepal.Width", "Petal.Length"))
f(mtcars, c("wt", "mpg"))
Edit
If you would like to embed new names after, we can write a custom pipe-ready function:
Rename <- function(object, old, new) {
names(object)[names(object) %in% old] <- new
object
}
mtcars %>%
mutate_at(cols, delta) %>%
Rename(cols, paste0("lagged",cols))
If you want to rename the resulting lagged variables:
mtcars %>% mutate_at(cols, funs(lagged = delta))

R: Further subset a selection using the pipe %>% and placeholder

I recently discovered the pipe operator %>%, which can make code more readable. Here is my MWE.
library(dplyr) # for the pipe operator
library(lsr) # for the cohensD function
set.seed(4) # make it reproducible
dat <- data.frame( # create data frame
subj = c(1:6),
pre = sample(1:6, replace = TRUE),
post = sample(1:6, replace = TRUE)
)
dat %>% select(pre, post) %>% sapply(., mean) # works as expected
However, I struggle using the pipe operator in this particular case
dat %>% select(pre, post) %>% cohensD(.$pre, .$post) # piping returns an error
cohensD(dat$pre, dat$post) # classical way works fine
Why is it not possible to subset columns using the placeholder .in combination with $? Is it worthwhile to write this line using a pipe operator %>%, or does it complicate syntax? The classical way of writing this seems more concise.
This would work:
dat %>% select(pre, post) %>% {cohensD(.$pre, .$post)}
Wrapping the last call into curly braces makes it be treated like an expression and not a function call. When you pipe something into an expression, the . gets replaced as expected. I often use this trick to call a function which does not interface well with piping.
What is inside the braces happens to be a function call but could really be any expression of . .
Since you're going from a bunch of data into one (row of) value(s), you're summarizing. in a dplyr pipeline you can then use the summarize function, within the summarize function you don't need to subset and can just call pre and post
Like so:
dat %>% select(pre, post) %>% summarize(CD = cohensD(pre, post))
(The select statement isn't actually necessary in this case, but I left it in to show how this works in a pipeline)
It doesn't work because the . operator has to be used directly as an argument, and not inside a nested function (like $...) in your call.
If you really want to use piping, you can do it with the formula interface, but with a little reshaping before (melt is from reshape2 package):
dat %>% select(pre, post) %>% melt %>% cohensD(value~variable, .)
#### [1] 0.8115027

Creating a function with an argument passed to dplyr::filter what is the best way to work around nse?

Non standard evaluation is really handy when
using dplyr's verbs. But it can be problematic when using those
verbs with function arguments.
For example let us say that I want to create a function that
gives me the number of rows for a given species.
# Load packages and prepare data
library(dplyr)
library(lazyeval)
# I prefer lowercase column names
names(iris) <- tolower(names(iris))
# Number of rows for all species
nrow(iris)
# [1] 150
Example not working
This function doesn't work as expected because species
is interpreted in the context of the iris data frame
instead of being interpreted in the context of the
function argument:
nrowspecies0 <- function(dtf, species){
dtf %>%
filter(species == species) %>%
nrow()
}
nrowspecies0(iris, species = "versicolor")
# [1] 150
3 examples of implementation
To work around non standard evaluation,
I usually append the argument with an underscore :
nrowspecies1 <- function(dtf, species_){
dtf %>%
filter(species == species_) %>%
nrow()
}
nrowspecies1(iris, species_ = "versicolor")
# [1] 50
# Because of function name completion the argument
# species works too
nrowspecies1(iris, species = "versicolor")
# [1] 50
It is not completely satisfactory
since it changes the name of the function argument to
something less user friendly. Or it relies on autocompletion
which I'm afraid is not a good practice for programming.
To keep a nice argument name,
I could do :
nrowspecies2 <- function(dtf, species){
species_ <- species
dtf %>%
filter(species == species_) %>%
nrow()
}
nrowspecies2(iris, species = "versicolor")
# [1] 50
Another way to work around non standard evaluation
based on this answer.
interp() interprets species in the context of the
function environment:
nrowspecies3 <- function(dtf, species){
dtf %>%
filter_(interp(~species == with_species,
with_species = species)) %>%
nrow()
}
nrowspecies3(iris, species = "versicolor")
# [1] 50
Considering the 3 function above,
what is the preferred - most robust - way to implement this filter function?
Are there any other ways?
The answer from #eddi is correct about what's going on here.
I'm writing another answer that addresses the larger request of how to write functions using dplyr verbs. You'll note that, ultimately, it uses something like nrowspecies2 to avoid the species == species tautology.
To write a function wrapping dplyr verb(s) that will work with NSE, write two functions:
First write a version that requires quoted inputs, using lazyeval and
an SE version of the dplyr verb. So in this case, filter_.
nrowspecies_robust_ <- function(data, species){
species_ <- lazyeval::as.lazy(species)
condition <- ~ species == species_ # *
tmp <- dplyr::filter_(data, condition) # **
nrow(tmp)
}
nrowspecies_robust_(iris, ~versicolor)
Second make a version that uses NSE:
nrowspecies_robust <- function(data, species) {
species <- lazyeval::lazy(species)
nrowspecies_robust_(data, species)
}
nrowspecies_robust(iris, versicolor)
* = if you want to do something more complex, you may need to use lazyeval::interp here as in the tips linked below
** = also, if you need to change output names, see the .dots argument
For the above, I followed some tips from Hadley
Another good resource is the dplyr vignette on NSE, which illustrates .dots, interp, and other functions from the lazyeval package
For even more details on lazyeval see it's vignette
For a thorough discussion of the base R tools for working with NSE (many of which lazyeval helps you avoid), see the chapter on NSE in Advanced R
This question has absolutely nothing to do with non standard evaluation. Let me rewrite your initial function to make that clear:
nrowspecies4 <- function(dtf, boo){
dtf %>%
filter(boo == boo) %>%
nrow()
}
nrowspecies4(iris, boo = "versicolor")
#150
The expression inside your filter always evaluates to TRUE (almost always - see example below), that's why it doesn't work, not because of some NSE magic.
Your nrowspecies2 is the way to go.
Fwiw, species in your nrowspecies0 is indeed evaluated as a column, not as the input variable species, and you can check that by comparing nrowspecies0(iris, NA) to nrowspecies4(iris, NA).
in his 2016 UseR talk (#38min30s), Hadley Wickham explains the concept of referential transparency . Using a formula, the filter function can be reformulated as:
nrowspecies5 <- function(dtf, formula){
dtf %>%
filter_(formula) %>%
nrow()
}
This has the added benefit of beeing more generic
# Make column names lower case
names(iris) = tolower(names(iris))
nrowspecies5(iris, ~ species == "versicolor")
# 50
nrowspecies5(iris, ~ sepal.length > 6 & species == "virginica")
# 41
nrowspecies5(iris, ~ sepal.length > 6 & species == "setosa")
# 0

Resources