data.table .. notation with functions in j - r

I am trying to use data.table's .. notation with functions, here is the code I have so far:
set.seed(42)
dt <- data.table(
x = rnorm(10),
y = runif(10)
)
test_func <- function(data, var, var2) {
vars <- c(var, var2)
data[, ..vars]
}
test_func(dt, 'x', 'y') # this works
test_func2 <- function(data, var, var2) {
data[, ..var]
}
test_func2(dt, 'x', 'y') # this works too
test_func3 <- function(data, var, var2) {
data[, sum(..var)]
}
test_func3(dt, 'x', 'y')
# this does not work
# Error in eval(jsub, SDenv, parent.frame()) : object '..var' not found
It seems data.table does not recognize .. once it's wrapped inside another function in j. I know I can use sum(get(var)) to achieve the results but I want to know I am using the best practice in most situation.

Parroting an answer to a different problem that works here as well. Not the prettiest solution, but variants on this have worked for me numerous times in the past.
Thanks #Frank for a non-parse() solution here!
I'm well familiar with the old adage "If the answer is parse() you should usually rethink the question.", but I have a hard time coming up with alternatives many times when evaluating within the data.table calling environment, I'd love to see a robust solution that doesn't execute arbitrary code passed in as a character string. In fact, half the reason I'm posting an answer like this is in hopes that someone can recommend a better option.
test_func3 <- function(data, var, var2) {
expr = substitute(sum(var), list(var=as.symbol(var)))
data[, eval(expr)]
}
test_func3(dt, 'x', 'y')
## [1] 5.472968
Quick disclaimer on hypothetical doomsday scenarios possible with eval(parse(...))
There are far more in depth discussions on the dangers of eval(parse(...)), but I'll avoid repeating them in full.
Theoretically you could have issues if one of your columns is named something unfortunate like "(system(paste0('kill ',Sys.getpid())))" (Do not execute that, it will kill your R session on the spot!). This is probably enough of an outside chance to not lose sleep over it unless you plan on putting this in a package on CRAN.
Update:
For the specific case in the comments below where the table is grouped and then sum is applied to all, .SDcols is potentially useful. The only way I'm aware of to make sure that this function would return consistent results even if dt had a column named var3 is to evaluate the arguments within the function environment but outside of the data.table environment using c().
set.seed(42)
dt <- data.table(
x = rnorm(10),
y = rnorm(10),
z = sample(c("a","b","c"),size = 10, replace = TRUE)
)
test_func3 <- function(data, var, var2, var3) {
ListOfColumns = c(var,var2)
GroupColumn <- c(var3)
dt[, lapply(.SD, sum), by= eval(GroupColumn), .SDcols = ListOfColumns]
}
test_func3(dt, 'x', 'y','z')
returns
z x y
1: b 1.0531555 2.121852
2: a 0.3631284 -1.388861
3: c 4.0566838 -2.367558

Related

Possible bug with .SD lapply?

Using data.table version 1.8.8. Why does this work:
dat <- data.table(a=1:5,b=5:1)
sdat <- dat[,lapply(.SD,function(x) x*b)]
but this
dat <- data.table(a=1:5,b=5:1)
f <- function(x) x*b
sdat <- dat[,lapply(.SD,f)]
gives
Error in FUN(X[[1L]], ...) : object 'b' not found
Anything I'm missing?
I wouldn't quite call this a bug - when you call f, a and b are being passed to it as a vectors called x. (More precisely, .SD is being passed)
So while a and b exist within j, the body of your function f is not evaluated within j.
To illustrate, see what happens when you run
with(dat, f(a))
I'd recommend just making b an argument of the function to avoid depending on name consistency down the road.
f = function(x,b) x * b
dat[,sapply(.SD, f, b=b)]
You should always pass the variables explictly if you use lapply:
library(data.table)
dat <- data.table(a=1:5, b=5:1)
f <- function(x, b) x*b
sdat <- dat[,lapply(.SD ,f, b=b)]
That avoids scoping issues.

Use ddply within a function and include variable of interest as an argument

I am relatively new to R, and trying to use ddply & summarise from the plyr package. This post almost, but not quite, answers my question. I could use some additional explanation/clarification.
My problem:
I want to create a simple function to summarize descriptive statistics, by group, for a given variable. Unlike the linked post, I would like to include the variable of interest as an argument to the function. As has already been discussed on this site, this works:
require(plyr)
ddply(mtcars, ~ cyl, summarise,
mean = mean(hp),
sd = sd(hp),
min = min(hp),
max = max(hp)
)
But this doesn't:
descriptives_by_group <- function(dataset, group, x)
{
ddply(dataset, ~ group, summarise,
mean = mean(x),
sd = sd(x),
min = min(x),
max = max(x)
)
}
descriptives_by_group(mtcars, cyl, hp)
Because of the volume of data with which I am working, I would like to be able to have a function that allows me to specify the variable of interest to me as well as the dataset and grouping variable.
I have tried to edit the various solutions found here to address my problem, but I don't understand the code well enough to do it successfully.
The original poster used the following example dataset:
a = c(1,2,3,4)
b = c(0,0,1,1)
c = c(5,6,7,8)
df = data.frame(a,b,c)
sv = c("b")
With the desired output:
b Ave
1 0 1.5
2 1 3.5
And the solution endorsed by Hadley was:
myFunction <- function(x, y){
NewColName <- "a"
z <- ddply(x, y, .fun = function(xx,col){
c(Ave = mean(xx[,col],na.rm=TRUE))},
NewColName)
return(z)
}
Where myFunction(df, sv) returns the desired output.
I tried to break down the code piece-by-piece to see if, by getting a better understanding of the underlying mechanics, I could modify the code to include an argument to the function that would pass to what, in this example, is "NewColName" (the variable you want to get information about). But I am not having any success. My difficulty is that I do not understand what is happening with (xx[,col]). I know that mean(xx[,col]) should be taking the mean of the column with index col for the data frame xx. But I don't understand where the anonymous function is reading those values from.
Could someone please help me parse this? I've wasted hours on a trivial task I could accomplish easily with very repetitive code and/or with subsetting, but I got hung up on trying to make my script more simple and elegant, and on understanding the "whys" of this problem and its solution(s).
PS I have looked into the describeBy function from the psych package, but as far as I can tell, it does not let you specify the variable(s) you want to return values for, and consequently does not solve my problem.
I just moved a couple things around in the example function you gave and showed how to get more than one column back out. Does this do what you want?
myFunction2 <- function(x, y, col){
z <- ddply(x, y, .fun = function(xx){
c(mean = mean(xx[,col],na.rm=TRUE),
max = max(xx[,col],na.rm=TRUE) ) })
return(z)
}
myFunction2(mtcars, "cyl", "hp")
(More of a comment than an answer. I had the same level of difficulty as you when using ddply(...,summarise, ...) inside a function.) This is a base solution that worked the way I expected:
descriptives_by_group <- function(dataset, group, x)
{aggregate(dataset[[x]], dataset[group], function(x)
c( mean = mean(x),
sd = sd(x),
min = min(x),
max = max(x)
) )
}
descriptives_by_group(mtcars, 'cyl', 'hp')
Just use as.quoted function. Example below
simple_ddply <- function(dataset_name, variable_name){
data <- ddply(dataset_name,as.quoted(variable_name), *remaining input)**
With the introduction of quosures in the devel version of dplyr (soon to be released 0.6.0), this becomes a bit more easier
library(dplyr)
descriptives_by_groupN <- function(dataset, group, x) {
group <- enquo(group)
x <- enquo(x)
dataset %>%
group_by(!!group) %>%
summarise(Mean = mean(!!x),
SD = sd(!!x),
Min = min(!!x),
Max = max(!!x))
}
descriptives_by_groupN(mtcars, cyl, hp)
# A tibble: 3 × 5
# cyl Mean SD Min Max
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 82.63636 20.93453 52 113
#2 6 122.28571 24.26049 105 175
#3 8 209.21429 50.97689 150 335
Here, the input arguments are converted to quosures with enquo, and inside the group_by/summarise, unquote the quosures (!! or UQ) to get it evaluated

Object not found error with ddply inside a function

This has really challenged my ability to debug R code.
I want to use ddply() to apply the same functions to different columns that are sequentially named; eg. a, b, c. To do this I intend to repeatedly pass the column name as a string and use the eval(parse(text=ColName)) to allow the function to reference it. I grabbed this technique from another answer.
And this works well, until I put ddply() inside another function. Here is the sample code:
# Required packages:
library(plyr)
myFunction <- function(x, y){
NewColName = "a"
z = ddply(x, y, summarize,
Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
)
return(z)
}
a = c(1,2,3,4)
b = c(0,0,1,1)
c = c(5,6,7,8)
df = data.frame(a,b,c)
sv = c("b")
#This works.
ColName = "a"
ddply(df, sv, summarize,
Ave = mean(eval(parse(text=ColName)), na.rm=TRUE)
)
#This doesn't work
#Produces error: "Error in parse(text = NewColName) : object 'NewColName' not found"
myFunction(df,sv)
#Output in both cases should be
# b Ave
#1 0 1.5
#2 1 3.5
Any ideas? NewColName is even defined inside the function!
I thought the answer to this question, loops-to-create-new-variables-in-ddply, might help me but I've done enough head banging for today and it's time to raise my hand and ask for help.
Today's solution to this question is to make summarize into here(summarize). e.g.
myFunction <- function(x, y){
NewColName = "a"
z = ddply(x, y, here(summarize),
Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
)
return(z)
}
here(f), added to plyr in Dec 2012, captures the current context.
You can do this with a combination of do.call and call to construct the call in an environment where NewColName is still visible:
myFunction <- function(x,y){
NewColName <- "a"
z <- do.call("ddply",list(x, y, summarize, Ave = call("mean",as.symbol(NewColName),na.rm=TRUE)))
return(z)
}
myFunction(d.f,sv)
b Ave
1 0 1.5
2 1 3.5
I occasionally run into problems like this when combining ddply with summarize or transform or something and, not being smart enough to divine the ins and outs of navigating various environments I tend to side-step the issue by simply not using summarize and instead using my own anonymous function:
myFunction <- function(x, y){
NewColName <- "a"
z <- ddply(x, y, .fun = function(xx,col){
c(Ave = mean(xx[,col],na.rm=TRUE))},
NewColName)
return(z)
}
myFunction(df,sv)
Obviously, there is a cost to doing this stuff 'manually', but it often avoids the headache of dealing with the evaluation issues that come from combining ddply and summarize. That's not to say, of course, that Hadley won't show up with a solution...
The problem lies in the code of the plyr package itself. In the summarize function, there is a line eval(substitute(...),.data,parent.frame()). It is well known that parent.frame() can do pretty funky and unexpected stuff. T
he solution of #James is a very nice workaround, but if I remember right #Hadley himself said before that the plyr package was not intended to be used within functions.
Sorry, I was wrong here. It is known though that for the moment, the plyr package gives problems in these situations.
Hence, I give you a base solution for the problem :
myFunction <- function(x, y){
NewColName = "a"
z = aggregate(x[NewColName],x[y],mean,na.rm=TRUE)
return(z)
}
> myFunction(df,sv)
b a
1 0 1.5
2 1 3.5
Looks like you have an environment problem. Global assignment fixes the problem, but at the cost of one's soul:
library(plyr)
a = c(1,2,3,4)
b = c(0,0,1,1)
c = c(5,6,7,8)
d.f = data.frame(a,b,c)
sv = c("b")
ColName = "a"
ddply(d.f, sv, summarize,
Ave = mean(eval(parse(text=ColName)), na.rm=TRUE)
)
myFunction <- function(x, y){
NewColName <<- "a"
z = ddply(x, y, summarize,
Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
)
return(z)
}
myFunction(x=d.f,y=sv)
eval is looking in parent.frame(1). So if you instead define NewColName outside MyFunction it should work:
rm(NewColName)
NewColName <- "a"
myFunction <- function(x, y){
z = ddply(x, y, summarize,
Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
)
return(z)
}
myFunction(x=d.f,y=sv)
By using get to pull out my.parse from the earlier environment, we can come much closer, but still have to pass curenv as a global:
myFunction <- function(x, y){
NewColName <- "a"
my.parse <- parse(text=NewColName)
print(my.parse)
curenv <<- environment()
print(curenv)
z = ddply(x, y, summarize,
Ave = mean( eval( get("my.parse" , envir=curenv ) ), na.rm=TRUE)
)
return(z)
}
> myFunction(x=d.f,y=sv)
expression(a)
<environment: 0x0275a9b4>
b Ave
1 0 1.5
2 1 3.5
I suspect that ddply is evaluating in the .GlobalEnv already, which is why all of the parent.frame() and sys.frame() strategies I tried failed.

Need to access variables from a parent apply within a child apply without globally scoping

Let me try this again, I'm going to leave out the exact data/example and just walk through what I need to accomplish.
I need to apply a function over the rows of a data.frame, that is easy. Then I need to derive some variables within that function using the data.frame that was passed to it. Finally, I'd like to apply a new function over a subset of the data.frame and use the derived variables in the new function.
Can someone please tell me the best practice way to do this rather than globally scoping each of my variables (var1, var2)?
cpt <- a.data.frame
query.db <- function(another.data.frame){
var1 <- some.values
var2 <- some.other.values
apply(cpt[var1,], 1, calc.enrichment) #calc.enrichment needs to access var1, var2!
}
I tried writing the calc.enrichment function as a user-defined function rather than outside of the scope, but my list of arguments (var1, var2) weren't being recognized. Thanks for any help.
This silly example works for me and seems to address what you are after. We use var1 to index into the columns of the data.frame used in the apply function as you did. var2 is just the standard deviation of the first column of the data.frame passed to it. I'm guessing your real example does something a tad bit more useful.
cpt <- data.frame(a = rnorm(5), b = rnorm(5), c = rnorm(5))
another.data.frame <- data.frame(d = rnorm(5), e = rnorm(5), f = rnorm(5))
query.db <- function(dat, outer.dat) {
var1 <- sample(1:nrow(dat), sample(1:nrow(dat), 1, FALSE), FALSE)
var2 <- sd(dat[, 1])
apply(outer.dat[var1 ,], 1, function(x) apples = x * sin(var2) / cos(var2) ^ 2)
}
query.db(another.data.frame, cpt)

Is it possible to reuse generated columns in ddply?

I have a script where I'm using ddply, as in the following example:
ddply(df, .(col),
function(x) data.frame(
col1=some_function(x$y),
col2=some_other_function(x$y)
)
)
Within ddply, is it possible to reuse col1 without calling the entire function again?
For example:
ddply(df, .(col),
function(x) data.frame(
col1=some_function(x$y),
col2=some_other_function(x$y)
col3=col1*col2
)
)
You've got a whole function to play with! Doesn't have to be a one-liner! This should work:
ddply(df, .(col), function(x) {
tmp <- some_other_function(x$y)
data.frame(
col1=some_function(x$y),
col2=tmp,
col3=tmp
)
})
This appears to be a good candidate for data.table using the scoping rules of the j component. See FAQ 2.8 for details.
From the FAQ
No anonymous function is passed to
the j. Instead, an anonymous body is passed to the j.
So, for your case
library(data.table)
DT <- as.data.table(df)
DT[,{
col1=some_function(y)
col2=some_other_function(y)
col3= col1 *col2
list(col1 = col1, col2 = col2, col3 = col3)
}, by = col]
or a slightly more direct way :
DT[,list(
col1=col1<-some_function(y)
col2=col2<-some_other_function(y)
col3=col1*col2
), by = col]
This avoids one repetition each of col1 and col2, and avoids two repeats of col3; repetition is something we strive to reduce in data.table. The = followed by <- might initially look cumbersome. That allows the following syntactic sugar, though :
DT[,list(
"Projected return (%)"= col1<-some_function(y),
"Investment ($m)"= col2<-some_other_function(y),
"Return on Investment ($m)"= col1*col2
), by = col]
where the output can be sent directly to latex or html, for example.
I don't think that's possible, but it shouldn't matter too much, because at that point it's not an aggregation function anymore. For example:
#use summarize() in ddply()
data.means <- ddply(data, .(groups), summarize, mean = mean(x), sd = sd(x), n = length(x))
data.means$se <- data.means$sd / sqrt(data.means$n)
data.means$Upper <- data.means$mean + (data.means$SE * 1.96)
data.means$Lower <- data.means$mean - (data.means$SE * 1.96)
So I didn't calculate the SEs directly, but it wasn't so bad calculating it outside of ddply(). If you really wanted to, you could also do
ddply(data, .(groups), summarize, se = sd(x) / sqrt(length(x)))
Or to put it in terms of your example
ddply(df, .(col), summarize,
col1=some_function(y),
col2=some_other_function(y)
col3=some_function(y)*some_other_function(y)
)

Resources