This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to write an R function that evaluates an expression within a data-frame
I want to write a function that sorts a data.frame -- instead of using the cumbersome order(). Given something like
> x=data.frame(a=c(5,6,7),b=c(3,5,1))
> x
a b
1 5 3
2 6 5
3 7 1
I want to say something like:
sort.df(x,b)
So here's my function:
sort.df <- function(df, ...) {
with(df, df[order(...),])
}
I was really proud of this. Given R's lazy evaluation, I figured that the ... parameter would only be evaluated when needed -- and by that time it would be in scope, due to 'with'.
If I run the 'with' line directly, it works. But the function doesn't.
> with(x,x[order(b),])
a b
3 7 1
1 5 3
2 6 5
> sort.df(x,b)
Error in order(...) : object 'b' not found
What's wrong and how to fix it? I see this sort of "magic" frequently in packages like plyr, for example. What's the trick?
This will do what you want:
sort.df <- function(df, ...) {
dots <- as.list(substitute(list(...)))[-1]
ord <- with(df, do.call(order, dots))
df[ord,]
}
## Try it out
x <- data.frame(a=1:10, b=rep(1:2, length=10), c=rep(1:3, length=10))
sort.df(x, b, c)
And so will this:
sort.df2 <- function(df, ...) {
cl <- substitute(list(...))
cl[[1]] <- as.symbol("order")
df[eval(cl, envir=df),]
}
sort.df2(x, b, c)
It's because when you're passing b you're actually not passing an object. Put a browser inside your function and you'll see what I mean. I stole this from some Internet robot somewhere:
x=data.frame(a=c(5,6,7),b=c(3,5,1))
sort.df <- function(df, ..., drop = TRUE){
ord <- eval(substitute(order(...)), envir = df, enclos = parent.frame())
return(df[ord, , drop = drop])
}
sort.df(x, b)
will work.
So will if you're looking for a nice way to do this in an applied sense:
library(taRifx)
sort(x, f=~b)
Related
I am trying to wrap up a function from an R package. From the source codes, it appears that it has non-standard evaluations for some arguments. How can I write my function to pass the value to the argument that has non-standard evaluation implemented?
Here is a toy example
data <- data.frame(name = 1:10)
#Suppose the one below is the function from that package
toy.fun <- function(dat, var) {
eval(substitute(var), dat)
}
> toy.fun(data, name)
[1] 1 2 3 4 5 6 7 8 9 10
Here is what I try to wrap it
toy.fun2 <- function(dat, var2) {
var_name <- deparse(substitute(var2))
#example, but for similar purpose.
data_subset <- dat[var_name]
toy.fun(data_subset, var2)
}
> toy.fun2(data, name)
Error in eval(substitute(var), dat) : object 'var2' not found
Edit
I should make the question more clear, in which I want to pass different variable name to the function argument in the wrapper for var2. So that when there are different variable names in the data, it could use that name for both data selection, and pass to the function I try to wrap. The source codes for exceedance function from heatwaveR has this ts_y <- eval(substitute(y), data) to capture the input variable y already. This is equivalent to my toy.fun.
I have modified the toy.fun2 for clarity.
Edit 2
It turns out the solution is quite easy, just by substituting the entire function including arguments, and evaluating it.
toy.fun2 <- function(dat, var2) {
var_name <- deparse(substitute(var2))
#example, but for similar purpose.
data_subset <- dat[var_name]
exprs <- substitute(toy.fun(data_subset, var2))
eval(exprs, parent.frame())
#include `envir` argument if this is to be worked in data.table
}
Grab the call using match.call, replace the function name and evaluate it.
It would also be possible to modify the other arguments as needed. Look at the source of lm to see another example of this approach.
toy.fun2 <- function(dat, var) {
cl <- match.call()
cl[[1L]] <- quote(toy.fun)
eval.parent(cl)
}
toy.fun2(data, name)
## [1] 1 2 3 4 5 6 7 8 9 10
Added
The question was revised after this answer was already posted. This addresses the new question. The line setting cl[[2]] could alternately be written as cl[[2]] <- dat[deparse(cl[[3]])] .
toy.fun2 <- function(dat, var2) {
cl <- match.call()
cl[[1]] <- quote(toy.fun)
cl[[2]] <- dat[deparse(substitute(var2))]
names(cl)[3] <- "var"
eval.parent(cl)
}
toy.fun2(data, name)
## [1] 1 2 3 4 5 6 7 8 9 10
I recently discovered that R allows chaining of assignments, e.g.
a = b = 1:10
a
[1] 1 2 3 4 5 6 7 8 9 10
b
[1] 1 2 3 4 5 6 7 8 9 10
I then thought that this could also be used in functions, if two arguments should take the same value. However, this was not the case. For example, plot(x = y = 1:10) produces the following error: Error: unexpected '=' in "plot(x = y =". What is different, and why doesn't this work? I am guessing this has something to with only the first being returned to the function, but both seem to be evaluated.
What are some possibilities and constraints with chained assignments in R?
I don't know about "canonical", but: this is one of the examples that illustrates how assignment (which can be interchangeably be done with <- and =) and passing named arguments (which can only be done using =) are different. It's all about the context in which the expressions x <- y <- 10 or x = y = 10 are evaluated. On their own,
x <- y <- 10
x = y = 10
do exactly the same thing (there are few edge cases where = and <- aren't completely interchangeable as assignment operators, e.g. having to do with operator precedence). Specifically, these are evaluated as (x <- (y <- 10)), or the equivalent with =. y <- 10 assigns the value to 10, and returns the value 10; then x <- 10 is evaluated.
Although it looks similar, this is not the same as the use of = to pass a named argument to a function. As noted by the OP, if f() is a function, f(x = y = 10) is not syntactically correct:
f <- function(x, y) {
x + y
}
f(x = y = 10)
## Error: unexpected '=' in "f(x = y ="
You might be tempted to say "oh, then I can just use arrows instead of equals signs", but this does something different.
f(x <- y <- 10)
## Error in f(x <- y <- 10) : argument "y" is missing, with no default
This statement tries to first evaluate the x <- y <- 10 expression (as above); once it works, it calls f() with the result. If the function you are calling will work with a single, unnamed argument (as plot() does), and you will get a result — although not the result you expect. In this case, since the function has no default value for y, it throws an error.
People do sometimes use <- with a function call as shortcut; in particular I like to use idioms like if (length(x <- ...) > 0) { <do_stuff> } so I don't have to repeat the ... later. For example:
if (length(L <- list(...))>0) {
warning(paste("additional arguments to ranef.merMod ignored:",
paste(names(L),collapse=", ")))
}
Note that the expression length(L <- list(...))>0) could also be written as !length(L <- list(...)) (since the result of length() must be a non-negative integer, and 0 evaluates to FALSE), but I personally think this is a bridge too far in terms of compactness vs readability ... I sometimes think it would be better to forgo the assignment-within-if and write this as L <- list(...); if (length(L)>0) { ... }
PS forcing the association of assignment in the other order leads to some confusing errors, I think due to R's lazy evaluation rules:
rm(x)
rm(y)
## neither x nor y is defined
(x <- y) <- 10
## Error in (x <- y) <- 10 : object 'x' not found
## both x and y are defined
x <- y <- 5
(x <- y) <- 10
## Error in (x <- y) <- 10 : could not find function "(<-"
This is a bug report, not a question. The procedure to report bugs in R core appears complicated, and I don't want to be part of a mailing list. So I'm posting this here (as recommended by https://www.r-project.org/bugs.html.)
Here it is:
The tapply() help of R 4.0.3 says the following on argument X:
an R object for which a split method exists. Typically vector-like, allowing subsetting with [.
Issue: this R object cannot be a data.frame, although a data.frame can be split and subsetted.
To reproduce, run the following:
func <- function(dt) {
sum(dt[,1] * dt[,2])
}
tab <- data.frame(x = sample(100), y = sample(100), z = sample(letters[1:10], 100, T))
tapply(tab[,1:2], INDEX = tab$z, FUN = func)
This results in
error in tapply(tab[, 1:2], INDEX = tab$z, FUN = func) :
arguments must have same length
which, upon looking at the tapply()source code, appears to result from this check:
if (!all(lengths(INDEX) == length(X)))
stop("arguments must have same length")
But length() is not the relevant function to call on a data.frame to determine if it has the right dimension for a split. nrow() should be used instead.
replacing the above code with
if(is.data.frame(X)) {
len <- nrow(X)
} else {
len <- length(X)
}
if (!all(lengths(INDEX) == len))
stop("arguments must have same length")
solves the error.
This fix looks rather straightforward, and implementing it would increase the usefulness of tapply() by a lot (I know there are powerful alternatives to tapply()), so I wonder if the current limitation reflects a design choice.
Based on the function, we could use
library(dplyr)
tab %>%
group_by(z) %>%
summarise(new = func(cur_data()), .groups = 'drop')
-output
# A tibble: 10 x 2
# z new
# <chr> <int>
# 1 a 26647
# 2 b 28010
# 3 c 31340
# 4 d 20780
# 5 e 33311
# 6 f 31880
# 7 g 37527
# 8 h 8752
# 9 i 15490
Or using by from base R
by(tab[, 1:2], tab$z, FUN = func)
According to ?tapply
X - an R object for which a split method exists. Typically vector-like, allowing subsetting with [.
Here, the tab[, 1:2] is a data.frame and not a vector. If it is a matrix, it would be a vector with dim attributes
I followed the discussion over HERE and am curious why is using<<- frowned upon in R. What kind of confusion will it cause?
I also would like some tips on how I can avoid <<-. I use the following quite often. For example:
### Create dummy data frame of 10 x 10 integer matrix.
### Each cell contains a number that is between 1 to 6.
df <- do.call("rbind", lapply(1:10, function(i) sample(1:6, 10, replace = TRUE)))
What I want to achieve is to shift every number down by 1, i.e all the 2s will become 1s, all the 3s will be come 2 etc. Therefore, all n would be come n-1. I achieve this by the following:
df.rescaled <- df
sapply(2:6, function(i) df.rescaled[df.rescaled == i] <<- i-1))
In this instance, how can I avoid <<-? Ideally I would want to be able to pipe the sapply results into another variable along the lines of:
df.rescaled <- sapply(...)
First point
<<- is NOT the operator to assign to global variable. It tries to assign the variable in the nearest parent environment. So, say, this will make confusion:
f <- function() {
a <- 2
g <- function() {
a <<- 3
}
}
then,
> a <- 1
> f()
> a # the global `a` is not affected
[1] 1
Second point
You can do that by using Reduce:
Reduce(function(a, b) {a[a==b] <- a[a==b]-1; a}, 2:6, df)
or apply
apply(df, c(1, 2), function(i) if(i >= 2) {i-1} else {i})
But
simply, this is sufficient:
ifelse(df >= 2, df-1, df)
You can think of <<- as global assignment (approximately, because as kohske points out it assigns to the top environment unless the variable name exists in a more proximal environment). Examples of why this is bad are here:
Examples of the perils of globals in R and Stata
Under what circumstances does the following example return a local x versus a global x?
The xi'an blog wrote the following at http://xianblog.wordpress.com/2010/09/13/simply-start-over-and-build-something-better/
One of the worst problems is scoping. Consider the following little gem.
f =function() {
if (runif(1) > .5)
x = 10
x
}
The x being returned by this function is randomly local or global. There are other examples where variables alternate between local and non-local throughout the body of a function. No sensible language would allow this. It’s ugly and it makes optimisation really difficult. This isn’t the only problem, even weirder things happen because of interactions between scoping and lazy evaluation.
PS - Is this xi'an blog post written by Ross Ihaka?
Edit - Follow up question.
Is this the remedy?
f = function() {
x = NA
if (runif(1) > .5)
x = 10
x
}
This is only a problem if you write functions that do not take arguments or the functionality relies on the scoping of variables outside the current frame. you either i) pass in objects you need in the function as arguments to that function, or ii) create those objects inside the function that uses them.
Your f is coded incorrectly. If you possibly alter x, then you should pass x in, possibly setting a default of NA or similar if that is what you want the other side of the random flip to be.
f <- function(x = NA) {
if (runif(1) > .5)
x <- 10
x
}
Here we see the function works as per your second function, but by properly assigning x as an argument with appropriate default. Note this works even if we have another x defined in the global workspace:
> set.seed(3)
> replicate(10, f())
[1] NA 10 NA NA 10 10 NA NA 10 10
> x <- 4
> set.seed(3)
> replicate(10, f())
[1] NA 10 NA NA 10 10 NA NA 10 10
Another benefit of this is that you can pass in an x if you want to return some other value instead of NA. If you don't need that facility, then defining x <- NA in the function is sufficient.
The above is predicated on what you actually want to do with f, which isn't clear from your posting and comments. If all you want to do is randomly return 10 or NA, define x <- NA.
Of course, this function is very silly as it can't exploit vectorisation in R - it is very much a scalar operation, which we know is slow in R. A better function might be
f <- function(n = 1, repl = 10) {
out <- rep(NA, n)
out[runif(n) > 0.5] <- repl
out
}
or
f <- function(x, repl = 10) {
n <- length(x)
out <- rep(NA, n)
out[runif(n) > 0.5] <- repl
out
}
Ross's example function was, I surmise, intentionally simple and silly to highlight the scoping issue - it should not be taken as an example of writing good R code, nor would it have been intended as such. Be aware of the scoping feature and code accordingly, and you won't get bitten. You might even find you can exploit this feature...
The 'x' is only declared in the function if the 'if' condition is true, so if 'runif(1)>.5' then the second mentioning of the x will make the function return your local x (10), otherwise it will return a globally defined 'x' (and if 'x' is not defined globally then it will fail)
> f =function() {
+ if (T)
+ x = 10
+ x
+ }
> f()
[1] 10
> f =function() {
+ if (F)
+ x = 10
+ x
+ }
> f()
Error in f() : Object 'x' not found
> x<-77
> f()
[1] 77