Possible bug with .SD lapply? - r

Using data.table version 1.8.8. Why does this work:
dat <- data.table(a=1:5,b=5:1)
sdat <- dat[,lapply(.SD,function(x) x*b)]
but this
dat <- data.table(a=1:5,b=5:1)
f <- function(x) x*b
sdat <- dat[,lapply(.SD,f)]
gives
Error in FUN(X[[1L]], ...) : object 'b' not found
Anything I'm missing?

I wouldn't quite call this a bug - when you call f, a and b are being passed to it as a vectors called x. (More precisely, .SD is being passed)
So while a and b exist within j, the body of your function f is not evaluated within j.
To illustrate, see what happens when you run
with(dat, f(a))
I'd recommend just making b an argument of the function to avoid depending on name consistency down the road.
f = function(x,b) x * b
dat[,sapply(.SD, f, b=b)]

You should always pass the variables explictly if you use lapply:
library(data.table)
dat <- data.table(a=1:5, b=5:1)
f <- function(x, b) x*b
sdat <- dat[,lapply(.SD ,f, b=b)]
That avoids scoping issues.

Related

R function with formula return has large memory imprint

I have a function that fits a model which I call many times with the same big matrix (creating different formula inside each time). However, it seems that R saves copies of the data I use along the way, and so my memory explodes.
A simple deletion inside the function avoid this problem. However, is there a general way to avoid retaining the whole environment each time?
For example, running the following,
test <- function(X, y, rm.env=F){
df <- cbind(y, X)
names(df) <- c("label", paste0("X", as.character(1:ncol(X))))
f <- formula(label~1, data=df, env=emptyenv())
if (rm.env){
rm(list=c("df", "X", "y"))
}
print(pryr::object_size(f))
return(f)
}
X <- matrix(rnorm(700*10000), ncol=700)
y <- rnorm(10000)
m <- test(X, y)
print(pryr::object_size(m))
m <- test(X, y, rm.env=T)
print(pryr::object_size(m))
results in,
672 B
168 MB
672 B
1.13 kB
Note that the object in the first call has 168 MB behind it, so calling the first version over and over again eats a lot of memory fast.
formula(label~1, data=df, env=emptyenv()) calls the S3 method formula.formula. Let’s have a look at its code:
stats:::formula.formula
# function (x, ...)
# x
… the extra arguments are ignored!
In other words, your assignment is the same as if you had written simply f = label ~ 1. In particular, its associated environment is the local environment, not the empty environment. To fix this, you need to manually reset it:
test <- function (X, y) {
df <- cbind(y, X)
names(df) <- c("label", paste0("X", seq_along(X)))
# TODO: do something with `df` …
f <- label ~ 1
environment(f) <- emptyenv()
f
}

data.table .. notation with functions in j

I am trying to use data.table's .. notation with functions, here is the code I have so far:
set.seed(42)
dt <- data.table(
x = rnorm(10),
y = runif(10)
)
test_func <- function(data, var, var2) {
vars <- c(var, var2)
data[, ..vars]
}
test_func(dt, 'x', 'y') # this works
test_func2 <- function(data, var, var2) {
data[, ..var]
}
test_func2(dt, 'x', 'y') # this works too
test_func3 <- function(data, var, var2) {
data[, sum(..var)]
}
test_func3(dt, 'x', 'y')
# this does not work
# Error in eval(jsub, SDenv, parent.frame()) : object '..var' not found
It seems data.table does not recognize .. once it's wrapped inside another function in j. I know I can use sum(get(var)) to achieve the results but I want to know I am using the best practice in most situation.
Parroting an answer to a different problem that works here as well. Not the prettiest solution, but variants on this have worked for me numerous times in the past.
Thanks #Frank for a non-parse() solution here!
I'm well familiar with the old adage "If the answer is parse() you should usually rethink the question.", but I have a hard time coming up with alternatives many times when evaluating within the data.table calling environment, I'd love to see a robust solution that doesn't execute arbitrary code passed in as a character string. In fact, half the reason I'm posting an answer like this is in hopes that someone can recommend a better option.
test_func3 <- function(data, var, var2) {
expr = substitute(sum(var), list(var=as.symbol(var)))
data[, eval(expr)]
}
test_func3(dt, 'x', 'y')
## [1] 5.472968
Quick disclaimer on hypothetical doomsday scenarios possible with eval(parse(...))
There are far more in depth discussions on the dangers of eval(parse(...)), but I'll avoid repeating them in full.
Theoretically you could have issues if one of your columns is named something unfortunate like "(system(paste0('kill ',Sys.getpid())))" (Do not execute that, it will kill your R session on the spot!). This is probably enough of an outside chance to not lose sleep over it unless you plan on putting this in a package on CRAN.
Update:
For the specific case in the comments below where the table is grouped and then sum is applied to all, .SDcols is potentially useful. The only way I'm aware of to make sure that this function would return consistent results even if dt had a column named var3 is to evaluate the arguments within the function environment but outside of the data.table environment using c().
set.seed(42)
dt <- data.table(
x = rnorm(10),
y = rnorm(10),
z = sample(c("a","b","c"),size = 10, replace = TRUE)
)
test_func3 <- function(data, var, var2, var3) {
ListOfColumns = c(var,var2)
GroupColumn <- c(var3)
dt[, lapply(.SD, sum), by= eval(GroupColumn), .SDcols = ListOfColumns]
}
test_func3(dt, 'x', 'y','z')
returns
z x y
1: b 1.0531555 2.121852
2: a 0.3631284 -1.388861
3: c 4.0566838 -2.367558

Why doesn't lazy evaluation work in this R function? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to write an R function that evaluates an expression within a data-frame
I want to write a function that sorts a data.frame -- instead of using the cumbersome order(). Given something like
> x=data.frame(a=c(5,6,7),b=c(3,5,1))
> x
a b
1 5 3
2 6 5
3 7 1
I want to say something like:
sort.df(x,b)
So here's my function:
sort.df <- function(df, ...) {
with(df, df[order(...),])
}
I was really proud of this. Given R's lazy evaluation, I figured that the ... parameter would only be evaluated when needed -- and by that time it would be in scope, due to 'with'.
If I run the 'with' line directly, it works. But the function doesn't.
> with(x,x[order(b),])
a b
3 7 1
1 5 3
2 6 5
> sort.df(x,b)
Error in order(...) : object 'b' not found
What's wrong and how to fix it? I see this sort of "magic" frequently in packages like plyr, for example. What's the trick?
This will do what you want:
sort.df <- function(df, ...) {
dots <- as.list(substitute(list(...)))[-1]
ord <- with(df, do.call(order, dots))
df[ord,]
}
## Try it out
x <- data.frame(a=1:10, b=rep(1:2, length=10), c=rep(1:3, length=10))
sort.df(x, b, c)
And so will this:
sort.df2 <- function(df, ...) {
cl <- substitute(list(...))
cl[[1]] <- as.symbol("order")
df[eval(cl, envir=df),]
}
sort.df2(x, b, c)
It's because when you're passing b you're actually not passing an object. Put a browser inside your function and you'll see what I mean. I stole this from some Internet robot somewhere:
x=data.frame(a=c(5,6,7),b=c(3,5,1))
sort.df <- function(df, ..., drop = TRUE){
ord <- eval(substitute(order(...)), envir = df, enclos = parent.frame())
return(df[ord, , drop = drop])
}
sort.df(x, b)
will work.
So will if you're looking for a nice way to do this in an applied sense:
library(taRifx)
sort(x, f=~b)

Object not found error with ddply inside a function

This has really challenged my ability to debug R code.
I want to use ddply() to apply the same functions to different columns that are sequentially named; eg. a, b, c. To do this I intend to repeatedly pass the column name as a string and use the eval(parse(text=ColName)) to allow the function to reference it. I grabbed this technique from another answer.
And this works well, until I put ddply() inside another function. Here is the sample code:
# Required packages:
library(plyr)
myFunction <- function(x, y){
NewColName = "a"
z = ddply(x, y, summarize,
Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
)
return(z)
}
a = c(1,2,3,4)
b = c(0,0,1,1)
c = c(5,6,7,8)
df = data.frame(a,b,c)
sv = c("b")
#This works.
ColName = "a"
ddply(df, sv, summarize,
Ave = mean(eval(parse(text=ColName)), na.rm=TRUE)
)
#This doesn't work
#Produces error: "Error in parse(text = NewColName) : object 'NewColName' not found"
myFunction(df,sv)
#Output in both cases should be
# b Ave
#1 0 1.5
#2 1 3.5
Any ideas? NewColName is even defined inside the function!
I thought the answer to this question, loops-to-create-new-variables-in-ddply, might help me but I've done enough head banging for today and it's time to raise my hand and ask for help.
Today's solution to this question is to make summarize into here(summarize). e.g.
myFunction <- function(x, y){
NewColName = "a"
z = ddply(x, y, here(summarize),
Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
)
return(z)
}
here(f), added to plyr in Dec 2012, captures the current context.
You can do this with a combination of do.call and call to construct the call in an environment where NewColName is still visible:
myFunction <- function(x,y){
NewColName <- "a"
z <- do.call("ddply",list(x, y, summarize, Ave = call("mean",as.symbol(NewColName),na.rm=TRUE)))
return(z)
}
myFunction(d.f,sv)
b Ave
1 0 1.5
2 1 3.5
I occasionally run into problems like this when combining ddply with summarize or transform or something and, not being smart enough to divine the ins and outs of navigating various environments I tend to side-step the issue by simply not using summarize and instead using my own anonymous function:
myFunction <- function(x, y){
NewColName <- "a"
z <- ddply(x, y, .fun = function(xx,col){
c(Ave = mean(xx[,col],na.rm=TRUE))},
NewColName)
return(z)
}
myFunction(df,sv)
Obviously, there is a cost to doing this stuff 'manually', but it often avoids the headache of dealing with the evaluation issues that come from combining ddply and summarize. That's not to say, of course, that Hadley won't show up with a solution...
The problem lies in the code of the plyr package itself. In the summarize function, there is a line eval(substitute(...),.data,parent.frame()). It is well known that parent.frame() can do pretty funky and unexpected stuff. T
he solution of #James is a very nice workaround, but if I remember right #Hadley himself said before that the plyr package was not intended to be used within functions.
Sorry, I was wrong here. It is known though that for the moment, the plyr package gives problems in these situations.
Hence, I give you a base solution for the problem :
myFunction <- function(x, y){
NewColName = "a"
z = aggregate(x[NewColName],x[y],mean,na.rm=TRUE)
return(z)
}
> myFunction(df,sv)
b a
1 0 1.5
2 1 3.5
Looks like you have an environment problem. Global assignment fixes the problem, but at the cost of one's soul:
library(plyr)
a = c(1,2,3,4)
b = c(0,0,1,1)
c = c(5,6,7,8)
d.f = data.frame(a,b,c)
sv = c("b")
ColName = "a"
ddply(d.f, sv, summarize,
Ave = mean(eval(parse(text=ColName)), na.rm=TRUE)
)
myFunction <- function(x, y){
NewColName <<- "a"
z = ddply(x, y, summarize,
Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
)
return(z)
}
myFunction(x=d.f,y=sv)
eval is looking in parent.frame(1). So if you instead define NewColName outside MyFunction it should work:
rm(NewColName)
NewColName <- "a"
myFunction <- function(x, y){
z = ddply(x, y, summarize,
Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
)
return(z)
}
myFunction(x=d.f,y=sv)
By using get to pull out my.parse from the earlier environment, we can come much closer, but still have to pass curenv as a global:
myFunction <- function(x, y){
NewColName <- "a"
my.parse <- parse(text=NewColName)
print(my.parse)
curenv <<- environment()
print(curenv)
z = ddply(x, y, summarize,
Ave = mean( eval( get("my.parse" , envir=curenv ) ), na.rm=TRUE)
)
return(z)
}
> myFunction(x=d.f,y=sv)
expression(a)
<environment: 0x0275a9b4>
b Ave
1 0 1.5
2 1 3.5
I suspect that ddply is evaluating in the .GlobalEnv already, which is why all of the parent.frame() and sys.frame() strategies I tried failed.

Creation of an object inside names<-() gives an error. How to explain?

This
x <- list(12, 13)
names(y <- x) <- c("a", "b")
gives the error:
Error in names(y <- x) <- c("a", "b") : object 'y' not found
Can anyone explain why?
According to R's rules of evaluation y <- x should be evaluated inside the parent frame of names<-. So y should be created in global environment.
Thanks.
[update] If object y is already present in the global environment, then the error is:
Error in names(y <- x) <- c("a", "b") : could not find function "<-<-"
[update2] Here it is, another construct, which I encountered today.
(X <- matrix(0, nrow = 10, ncol = 10))[1:3] <- 3:5
Error during wrapup: object 'X' not found
This is related to the way that <- recursively transforms the LHS, appending "<-" to the names of functions to get the replacement form. The first argument is treated specially. Note the difference between the last two:
x <- a <- 1
`f<-` <- function(x, a, value) x
f(x, a <- 2) <- 2
f(x <- 2, a) <- 2
# Error in f(x <- 2, a) <- 2 : could not find function "<-<-"
For what you're trying to do, I'd use setNames anyway.
This is probably due to lazy evaluation. There is little guarentee what order things will be done in when doing multiple tasks in one line. Apparently in this case it tries to find y before evaluating the assignment. If you just ask for the names, then y is assigned.
It is best to do these types of things in 2 steps so you can be assured that the first is done before the second needs the results.

Resources