This has really challenged my ability to debug R code.
I want to use ddply() to apply the same functions to different columns that are sequentially named; eg. a, b, c. To do this I intend to repeatedly pass the column name as a string and use the eval(parse(text=ColName)) to allow the function to reference it. I grabbed this technique from another answer.
And this works well, until I put ddply() inside another function. Here is the sample code:
# Required packages:
library(plyr)
myFunction <- function(x, y){
NewColName = "a"
z = ddply(x, y, summarize,
Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
)
return(z)
}
a = c(1,2,3,4)
b = c(0,0,1,1)
c = c(5,6,7,8)
df = data.frame(a,b,c)
sv = c("b")
#This works.
ColName = "a"
ddply(df, sv, summarize,
Ave = mean(eval(parse(text=ColName)), na.rm=TRUE)
)
#This doesn't work
#Produces error: "Error in parse(text = NewColName) : object 'NewColName' not found"
myFunction(df,sv)
#Output in both cases should be
# b Ave
#1 0 1.5
#2 1 3.5
Any ideas? NewColName is even defined inside the function!
I thought the answer to this question, loops-to-create-new-variables-in-ddply, might help me but I've done enough head banging for today and it's time to raise my hand and ask for help.
Today's solution to this question is to make summarize into here(summarize). e.g.
myFunction <- function(x, y){
NewColName = "a"
z = ddply(x, y, here(summarize),
Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
)
return(z)
}
here(f), added to plyr in Dec 2012, captures the current context.
You can do this with a combination of do.call and call to construct the call in an environment where NewColName is still visible:
myFunction <- function(x,y){
NewColName <- "a"
z <- do.call("ddply",list(x, y, summarize, Ave = call("mean",as.symbol(NewColName),na.rm=TRUE)))
return(z)
}
myFunction(d.f,sv)
b Ave
1 0 1.5
2 1 3.5
I occasionally run into problems like this when combining ddply with summarize or transform or something and, not being smart enough to divine the ins and outs of navigating various environments I tend to side-step the issue by simply not using summarize and instead using my own anonymous function:
myFunction <- function(x, y){
NewColName <- "a"
z <- ddply(x, y, .fun = function(xx,col){
c(Ave = mean(xx[,col],na.rm=TRUE))},
NewColName)
return(z)
}
myFunction(df,sv)
Obviously, there is a cost to doing this stuff 'manually', but it often avoids the headache of dealing with the evaluation issues that come from combining ddply and summarize. That's not to say, of course, that Hadley won't show up with a solution...
The problem lies in the code of the plyr package itself. In the summarize function, there is a line eval(substitute(...),.data,parent.frame()). It is well known that parent.frame() can do pretty funky and unexpected stuff. T
he solution of #James is a very nice workaround, but if I remember right #Hadley himself said before that the plyr package was not intended to be used within functions.
Sorry, I was wrong here. It is known though that for the moment, the plyr package gives problems in these situations.
Hence, I give you a base solution for the problem :
myFunction <- function(x, y){
NewColName = "a"
z = aggregate(x[NewColName],x[y],mean,na.rm=TRUE)
return(z)
}
> myFunction(df,sv)
b a
1 0 1.5
2 1 3.5
Looks like you have an environment problem. Global assignment fixes the problem, but at the cost of one's soul:
library(plyr)
a = c(1,2,3,4)
b = c(0,0,1,1)
c = c(5,6,7,8)
d.f = data.frame(a,b,c)
sv = c("b")
ColName = "a"
ddply(d.f, sv, summarize,
Ave = mean(eval(parse(text=ColName)), na.rm=TRUE)
)
myFunction <- function(x, y){
NewColName <<- "a"
z = ddply(x, y, summarize,
Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
)
return(z)
}
myFunction(x=d.f,y=sv)
eval is looking in parent.frame(1). So if you instead define NewColName outside MyFunction it should work:
rm(NewColName)
NewColName <- "a"
myFunction <- function(x, y){
z = ddply(x, y, summarize,
Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
)
return(z)
}
myFunction(x=d.f,y=sv)
By using get to pull out my.parse from the earlier environment, we can come much closer, but still have to pass curenv as a global:
myFunction <- function(x, y){
NewColName <- "a"
my.parse <- parse(text=NewColName)
print(my.parse)
curenv <<- environment()
print(curenv)
z = ddply(x, y, summarize,
Ave = mean( eval( get("my.parse" , envir=curenv ) ), na.rm=TRUE)
)
return(z)
}
> myFunction(x=d.f,y=sv)
expression(a)
<environment: 0x0275a9b4>
b Ave
1 0 1.5
2 1 3.5
I suspect that ddply is evaluating in the .GlobalEnv already, which is why all of the parent.frame() and sys.frame() strategies I tried failed.
Related
What I wish to achieve
So I want to get the names of my function inside a list of function.
Here is an example:
foo = list(foo1 = sum, foo2 = mean)
What I wish to extract from foo is:
list("sum", "mean")
And I would like it to be a function, meaning:
> foo = list(foo1 = sum, foo2 = mean)
> super_function(foo)
list("sum", "mean")
What I have checked
Applying names:
> sapply(foo , names)
$`foo1`
NULL
$foo2
NULL
Applying deparse(substitute())
> my_f <- function(x)deparse(substitute(x))
> sapply(foo, my_f)
foo1 foo2
"X[[i]]" "X[[i]]"
Neither idea works....
More background:
Here are some more details. One don't need them to understand the first question, but are extra details asked by community.
I'm using those functions as aggregation functions given by the user.
data(iris)
agg_function<-function(data, fun_to_apply){
res <- list()
for (col_to_transform in names(fun_to_apply)){
res[col_to_transform] <- (fun_to_apply[[col_to_transform]])(data[[col_to_transform]])
}
res
}
agg_function(iris, fun_to_apply = list("Sepal.Length" = mean, "Petal.Length" = sum))
Result is:
$`Sepal.Length`
[1] 5.843333
$Petal.Length
[1] 563.7
In this example I'm performing aggregation on two columns of iris. But I wish to have the name of the performed function in the name of each field of my result.
NB: This is an over simplification of what I'm doing;
Conclusion:
Do you have any ideas?
If you are starting from just the list foo = list(foo1 = sum, foo2 = mean), then it's not possible. The call to list() will evaluate the parameters returning the values that the variables sum and mean point to but it will not remember those variable names. Functions don't have names in R. But functions can be assigned to variables. However in R functions can live without names as well.
You've basically just created a named list of function. That might also look like this
foo = list(foo1 = function(x) sum(x+1),
foo2 = function(x) mean(x+1))
Here we also have functions, but these functions don't have "names" other than the names you gave to them in the list.
This only chance you have of making this work is using something other than list() when creating foo in the first place. Or having them actually explicitly call list() in the function call (which isn't very practical).
Despite you already said that tidyverse is not suitable for you, I will add this as an other idea.
agg_function <- function(df, x, ...){
df %>%
summarise_at(.vars = x, funs(...))
}
agg_function(iris, c("Sepal.Length", "Petal.Length"), mean, sum)
Sepal.Length_mean Petal.Length_mean Sepal.Length_sum Petal.Length_sum
1 5.843333 3.758 876.5 563.7
You can use a list with the functions as strings
foo <- list(foo1 = "mean", foo2 = "sum")
foo
$foo1
[1] "mean"
$foo2
[1] "sum"
get(foo[[1]])(1:10)
[1] 5.5
get(foo[[2]])(1:10)
[1] 55
Or use the rlang package and do something like
library(rlang)
foo <- quos(foo1 = mean, foo2 = sum)
getNames <- function(x) {
+ sapply(x, function(x) x[[2]])
+ }
getNames(foo)
$foo1
mean
$foo2
sum
eval_tidy(foo[[1]])(1:10)
[1] 5.5
eval_tidy(foo[[2]])(1:10)
[1] 55
This also works with non named functions
foo <- quos(foo1 = function(x) sum(x + 1), foo2 = sum)
getNames(foo)
$foo1
function(x) sum(x + 1)
$foo2
sum
eval_tidy(foo[[1]])(1:10)
[1] 65
I'm not quite familiar with R function dealing with variables used.
Here's the problem:
I want to built a function, of which variables ... are column names of data frame used for table().
f <- function (data, ...){
T <- with(data, table(...) # ... variables input
return(T)
}
How can I deal with the code?
Thanks a lot for answering!
The order of evaluation doesn't quite work right with with() apparently. Here's an alternative that should work (using sample data from #DavidArenburg)
set.seed(1)
data1 <- data.frame(a = sample(5,5), b = sample(5,5))
f <- function (data, ...) {
xx <- lapply(substitute(...()), eval, data, parent.frame())
T <- do.call(table, xx)
return(T)
}
f(data = data1, a,b)
It is often far easier to avoid non-standard evaluation and use character strings to reference the columns within a data.frame.
set.seed(1)
data1 <- data.frame(a = sample(5,5), b = sample(5,5))
f <- function (data, ...) {
do.call(table,data[unlist(list(...))])
}
# the following calls to `f` return the same results
f(data = data1, 'a','b')
f(data = data1, c('a','b'))
a <- c('a','b')
f(data = data1, a)
Please note that I already had a look at this and that but still cannot solve my problem.
Suppose a minimal working example:
a <- c(1,2,3)
b <- c(2,3,4)
c <- c(4,5,6)
dftest <- data.frame(a,b,c)
foo <- function(x, y, data = data) {
data[, c("x","y")]
}
foo(a, b, data = dftest)
Here, the last line obviously returns an Error: undefined columns selected. This error is returned because the columns to be selected are x and y, which are not part of the data frame dftest.
Question: How do I need to formulate the definition of the function to obtain the desired output, which is
> dftest[, c("a","b")]
# a b
# 1 1 2
# 2 2 3
# 3 3 4
which I want to obtain by calling the function foo.
Please be aware that in order for the solution to be useful for my purposes, the format of the function call of foo is to be regarded fixed, that is, the only changes are to be made to the function itself, not the call. I.e. foo(a, b, data = dftest) is the only input to be allowed.
Approach: I tried to use paste and substitute in combination with eval to first replace the x and y with the arguments of the function call and then evaluate the call. However, escaping the quotation marks seems to be a problem here:
foo <- function(x, y, data = data) {
substitute(data[, paste("c(\"",x,"\",\"",y,"\")", sep = "")])
}
foo(a, b, data = dftest)
eval(foo(a, b, data = dftest))
Here, foo(a, b, data = dftest) returns:
dftest[, paste("c(\"", a, "\",\"", b, "\")", sep = "")]
However, when evaluating with eval() (focusing only on the paste part),
paste("c(\"", a, "\",\"", b, "\")", sep = "")
returns:
# "c(\"1\",\"2\")" "c(\"2\",\"3\")" "c(\"3\",\"4\")"
and not, as I would hope c("a","b"), thus again resulting in the same error as above.
Try this:
foo <- function(x, y, data = data) {
x <- deparse(substitute(x))
y <- deparse(substitute(y))
data[, c(x, y)]
}
Using data.table version 1.8.8. Why does this work:
dat <- data.table(a=1:5,b=5:1)
sdat <- dat[,lapply(.SD,function(x) x*b)]
but this
dat <- data.table(a=1:5,b=5:1)
f <- function(x) x*b
sdat <- dat[,lapply(.SD,f)]
gives
Error in FUN(X[[1L]], ...) : object 'b' not found
Anything I'm missing?
I wouldn't quite call this a bug - when you call f, a and b are being passed to it as a vectors called x. (More precisely, .SD is being passed)
So while a and b exist within j, the body of your function f is not evaluated within j.
To illustrate, see what happens when you run
with(dat, f(a))
I'd recommend just making b an argument of the function to avoid depending on name consistency down the road.
f = function(x,b) x * b
dat[,sapply(.SD, f, b=b)]
You should always pass the variables explictly if you use lapply:
library(data.table)
dat <- data.table(a=1:5, b=5:1)
f <- function(x, b) x*b
sdat <- dat[,lapply(.SD ,f, b=b)]
That avoids scoping issues.
I have a function that I use to get a "quick look" at a data.frame... I deal with a lot of survey data and this acts as a quick tool to see what's what.
f.table <- function(x) {
if (is.factor(x[[1]])) {
frequency <- function(x) {
x <- round(length(x)/n, digits=2)
}
x <- na.omit(melt(x,c()))
x <- cast(x, variable ~ value, frequency)
x <- cbind(x,top2=x[,ncol(x)]+x[,ncol(x)-1], bottom=x[,2])
}
if (is.numeric(x[[1]])) {
frequency <- function(x) {
x[x > 1] <- 1
x[is.na(x)] <- 0
x <- round(sum(x)/n, digits=2)
}
x <- na.omit(melt(x))
x <- cast(x, variable ~ ., c(frequency, mean, sd, min, max))
x <- transform(x, variable=reorder(variable, frequency))
}
return(x)
}
What I find happens is that if I don't define "frequency" outside of the function, it returns wonky results for data frames with continuous variables. It doesn't seem to matter which definition I use outside of the function, so long as I do.
try:
n <- 100
x <- data.frame(a=c(1:25),b=rnorm(100),c=rnorm(100))
x[x > 20] <- NA
Now, select either one of the frequency functions and paste them in and try it again:
frequency <- function(x) {
x <- round(length(x)/n, digits=2)
}
f.table(x)
Why is that?
Crucially, I think this is where your problem is. cast() is evaluating those functions without reference to the function it was called from. Inside cast() it evaluates fun.aggregate via funstofun and, although I don't really follow what it is doing, is getting stats:::frequency and not your local one.
Hence my comment to your Q. What do you wan the function to do? At the moment it would seem necessary to define a "frequency" function in the global environment so that cast() or funstofun() finds it. Give it a unique name so it is unlikely to clash with anything so it should be the only thing found, say .Frequency(). Without knowing what you want to do with the function (rather than what you thought the function [f.table] should do) it is a bit difficult to provide further guidance, but why not have .FrequencyNum() and .FrequencyFac() defined in the global workspace and rewrite your f.table() wrapper calls to cast to use the relevant one?
.FrequencyFac <- function(X, N) {
round(length(X)/N, digits=2)
}
.FrequencyNum <- function(X, N) {
X[X > 1] <- 1
X[is.na(X)] <- 0
round(sum(X)/N, digits=2)
}
f.table <- function(x, N) {
if (is.factor(x[[1]])) {
x <- na.omit(melt(x, c()))
x <- dcast(x, variable ~ value, .FrequencyFac, N = N)
x <- cbind(x,top2=x[,ncol(x)]+x[,ncol(x)-1], bottom=x[,2])
}
if (is.numeric(x[[1]])) {
x <- na.omit(melt(x))
x <- cast(x, variable ~ ., c(.FrequencyNum, mean, sd, min, max), N = N)
##x <- transform(x, variable=reorder(variable, frequency))
## left this out as I wanted to see what cast returned
}
return(x)
}
Which I thought would work, but it is not finding N, and it should be. So perhaps I am missing something here?
By the way, it is probably not a good idea to rely on function that find n (in your version) from outside the function. Always pass in the variables you need as arguments.
I don't have the package that contains melt, but there are a couple potential issues I can see:
Your frequency functions do not return anything.
It's generally bad practice to alter function inputs (x is the input and the output).
There is already a generic frequency function in stats package in base R, which may cause issues with method dispatch (I'm not sure).