dplyr and overlapping variable names with surrounding environment - r

Let's say I have a (dplyr/tibble) data-frame/tbl constructed like so:
df <- data_frame(x = 1:10)
Now, I'd like to use this within a function that works with df via some dplyr verbs, like so:
myfun <- function(df, x) {
x <- doSomeStuffTo(x)
filter(df, x == x)
}
But this will always return the full df... I'm trying to figure out a way to implement scoping within a dplyr verb, something like:
filter_(df, ~x == x)
... which doesn't work, either. In some other languages, you might be able to achieve this via something like:
df.filter(this.x == x)
... where this refers to the df instance.
My only work-around so far is naming the function's variable like so:
myfun <- function(df, query_x) {
query_x <- doSomeStuffTo(query_x)
filter(df, x == query_x)
}
I suspect this is doable (without using a name like query_x) somehow with SE dplyr verbs (e.g. filter_), but I haven't stumbled upon the correct pattern yet. Anyone here have the answer?

To dynamically build different dplyr commands you typically use the standard evaluation versions of the functions (the ones with the underscores) and the lazyeval package. Here's how you could change your function
doSomeStuffTo <- function(x) {x+1}
myfun <- function(df, x) {
x <- doSomeStuffTo(x)
filter_(df, lazyeval::interp(~x == y, y=x))
}
df <- data_frame(x = 1:10)
myfun(df,3)
but even in the interp we can't have x==x because it's not clear which x you want to replace. Both filter(df, 3==x) and filter(df, x==3) work with dplyr. You can have constants or column names on either side of the equality.

If you use filter_ you can pass logical expressions via quote:
myfun <- function(df, t) {
df$x <- 5*df$x
filter_(df, t )
}
> myfun(df, t= quote(x < 25) )
# A tibble: 4 x 1
x
<dbl>
1 5
2 10
3 15
4 20

I stumbled into the same issue. Instead of wrangling with even more complex evaluations, it's usually easier to just rename the function argument. Like this:
myfun <- function(df, x) {
x_ <- doSomeStuffTo(x)
filter(df, x == x_)
}
This solution is still dangerous because we might hit another variable called x_. One can be defensive about this by checking the variable names in df and making sure to pick one that isn't there. Or more lazily, one can use very implausible variable names. I often use stuff like _____temp.
Maybe the new dplyr 0.6.0 evaluation system will handle this better. See the notes about the new system, tidyeval.

Related

How to call the output of a function in another function?

I have two functions:
getTotalBL <- function(Ne, n){
...
total_branch_length #output
}
getSNPnumber <- function(total_branch_length,mu,L){
}
Where the total_branch_length in getSNPnumber is the output of the first function (getTotalBL)
Do I need to do something more than write the same name of the output or is it correct this way?
You need to store the output of getTotalBL in an object and pass that on as a function argument to getSNPnumber. The scope of total_branch_length is restricted to getTotalBL.
Here are two examples to demonstrate:
Possibility 1:
f1 <- function(x) x^2;
f2 <- function(xsquared, b) xsquared + b;
f2(f1(2), 1)
#[1] 5
which is the same as
ret_from_f1 <- f1(2);
f2(ret_from_f1, 1);
#[1] 5
Possibility 2:
We can also have a function as an argument of another function (here f2):
f2 <- function(fct, x, b) fct(x) + b;
f2(f1, 2, 1)
#[1] 5
If all you're interested in is transferring the results from one function into another, I'd like to suggest the %>% function; it lets you pipe/chain results from one command into another.
It's available in packages magrittr (ordplyr if you're already using tidyverse).
Reusing the above 'Possibility 1'
f1 <- function(x) x^2;
f2 <- function(xsquared, b) xsquared + b;
require(dplyr)
f1(2) %>% f2(1)
UPDATE: Why %>% is useful
To my extremely limited knowledge, R stores all objects in RAM. When you create objects, only for them to be removed, they are still created in RAM. Using %>% lets you bypass this.

R - Creating function call within function using relational operator as variable

I am trying to write a function that will apply a user-specified binary operator (e.g. < ) to a raster object. To do so is fairly simple. For example:
selection <- raster::overlay(x = data, fun = function(x) {return(x < 2)}
My issue is that this code would be running within a function, with which I would like to specify both the binary operator and the criteria value (which is 2 in the example above) as variables. For example:
my.func <- function(data, binary_operator, value){
selection <- raster::overlay(x=data, fun=function(x) {x criteria value})
return(selection)
}
I have tried to construct the function as a call without success.
my.func <- function(data, binary_operator, value){
selection <- raster::overlay(x=data, fun=function(x) {call(sprintf("x %s %s", criteria, value))}
return(selection)
}
Is there a way to construct the call of the second function using variables in the first function?
Thanks for your help.
Write your code like this:
my.func <- function(data, binary_operator, value){
selection <- raster::overlay(x=data, fun=function(x) binary_operator(x, value))
return(selection)
}
You need to call this as
my.func(data, `<`, 2)
(with backticks for quotes). If you want to allow "<" for the operator, you could use do.call:
my.func <- function(data, binary_operator, value){
selection <- raster::overlay(x=data, fun=function(x)
do.call(binary_operator, list(x, value)))
return(selection)
}
This will work with either form of argument.
The example is probably simpler than the real case, but you in the example you use, it would be more direct to do:
selection <- data < 2

Object not found - nested function - R

I am still getting used with functions. I had a look in environments documentation but I can't figure out how to solve the error. Lets see what I tried until now:
I have a list of documents. Lets suppose it is "core"
library(dplyr)
table_1 <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
table_2 <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
core <- list(table_1, table_2)
Then, I have to run the function documents_ for each element of the list. This function gives some parameters to execute in another nested function:
documents_ <- function(i) {
core_processed <- as.data.frame(core[[i]])
x <- 1:nrow(core_processed)
y <- 1:ncol(core_processed)
temp <- sapply(x, function(x) mapply(calc_dens_,x,y))
return(temp)
}
Inside that, there is the function calc_dens, which is:
calc_dens_ <- function(x, y) {
core_temp <- core_processed %>%
filter(X2 == x & X3 == y)
return(core_temp)
}
Then, for iterate for each element of the list, I tried without success:
calc <- lapply(c(1:2), function(i) documents_(i))
Error in eval(lhs, parent, parent) : object 'core_processed' not found
The calc_dens function doesn't get the results of the documents_ (environment problem. Is there a way to solve this, or another better approach? My function is more complex than this, but the main elements are in this example. Thank you in advance.
As the other commenters have said, the problem is that you are referring to a variable, core_processed that is not in scope. You can make it a global variable, but it might be more sensible just to use it in a closure like this:
table_1 <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
table_2 <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
cores <- list(table_1, table_2)
documents_ <- function(core_processed) {
x <- 1:nrow(core_processed)
y <- 1:ncol(core_processed)
calc_dens <- function(x, y) core_processed %>% filter(X2 == x & X3 == y)
sapply(x, function(x) mapply(calc_dens, x, y))
}
calc <- lapply(cores, documents_)
If cores is a list of data frames, you do not need to to use as.data.frame and since you use lapply, there is no need to apply over indices and then index into the list. So the code I wrote here is simplified but does the same as your code.
I have to wonder, though, is this really what you want? The sapply over x and then mapply over x and y -- where x is the one from the sapply and not the ist you built in documents_ -- looks mighty strange to me.

Fixing a function to find and remove outliers from the dataset

I am trying to make a simple function which will find and remove outliers automatically. This is the function I have created so far:
fOutlier <- function(x, y) {
outlier <- with(x, boxplot.stats(y)$out)
subset(x, !(y %in% outlier))
}
data <- fOutlier(data, variable)
The problem is that the function does not read x as dataset name. It works if I use the following:
data <- fOutlier(data, data$variable)
Non-standard evaluation seems to be the culprit.
This is what I would personally do.
set.seed(1)
# mock data set
d<-data.frame(var1=rnorm(1000,500,50),
var2=rnorm(1000,1000,100),
var3=rnorm(1000,1000,100),
var4=rnorm(1000,1000,100))
fOutlier<-function(dat, var_name){
var_vec<-dat[,var_name]
outliers<-boxplot.stats(var_vec)$out
clean_dat<-dat[!(var_vec %in% outliers),]
}
# test with different variables
d_var1_clean<-fOutlier(d, 'var1')
d_var2_clean<-fOutlier(d, 'var2')
d_var3_clean<-fOutlier(d, 'var3')
If you really like the non-standard evaluation, then you can add eval() and substitute() to maintain this functionality.
This function is a workable version of what you posted (note the creation of y_vec):
fOutlier2 <- function(x, y) {
y_vec<-eval(substitute(y),eval(x))
outlier <- boxplot.stats(y_vec)$out
subset(x, !(y_vec %in% outlier))
}
d_var1_clean2<-fOutlier2(d, var1)

character string as function argument r

I'm working with dplyr and created code to compute new data that is plotted with ggplot.
I want to create a function with this code. It should take a name of a column of the data frame that is manipulated by dplyr. However, trying to work with columnnames does not work. Please consider the minimal example below:
df <- data.frame(A = seq(-5, 5, 1), B = seq(0,10,1))
library(dplyr)
foo <- function (x) {
df %>%
filter(x < 1)
}
foo(B)
Error in filter_impl(.data, dots(...), environment()) :
object 'B' not found
Is there any solution to use the name of a column as a function argument?
If you want to create a function which accepts the string "B" as an argument (as in you question's title)
foo_string <- function (x) {
eval(substitute(df %>% filter(xx < 1),list(xx=as.name(x))))
}
foo_string("B")
If you want to create a function which accepts captures B as an argument (as in dplyr)
foo_nse <- function (x) {
# capture the argument without evaluating it
x <- substitute(x)
eval(substitute(df %>% filter(xx < 1),list(xx=x)))
}
foo_nse(B)
You can find more information in Advanced R
Edit
dplyr makes things easier in version 0.3. Functions with suffixes "_" accept a string or an expression as an argument
foo_string <- function (x) {
# construct the string
string <- paste(x,"< 1")
# use filter_ instead of filter
df %>% filter_(string)
}
foo_string("B")
foo_nse <- function (x) {
# capture the argument without evaluating it
x <- substitute(x)
# construct the expression
expression <- lazyeval::interp(quote(xx < 1), xx = x)
# use filter_ instead of filter
df %>% filter_(expression)
}
foo_nse(B)
You can find more information in this vignette
I remember a similar question which was answered by #Richard Scriven. I think you need to write something like this.
foo <- function(x,...)filter(x,...)
What #Richard Scriven mentioned was that you need to use ... here. If you type ?dplyr, you will be able to find this: filter(.data, ...) I think you replace .data with x or whatever. If you want to pick up rows which have values smaller than 1 in B in your df, it will be like this.
foo <- function (x,...) filter(x,...)
foo(df, B < 1)

Resources