How to pass variables to functions called in spark_apply()? - r

I would like to be able to pass extra variables to functions that are called by spark_apply in sparklyr.
For example:
# setup
library(sparklyr)
sc <- spark_connect(master='local', packages=TRUE)
iris2 <- iris[,1:(ncol(iris) - 1)]
df1 <- sdf_copy_to(sc, iris2, repartition=5, overwrite=T)
# This works fine
res <- spark_apply(df1, function(x) kmeans(x, 3)$centers)
# This does not
k <- 3
res <- spark_apply(df1, function(x) kmeans(x, k)$centers)
As an ugly workaround, I can do what I want by saving values into R packages, and then referencing them. i.e
> myPackage::k_equals_three == 3
[1] TRUE
# This also works
res <- spark_apply(df1, function(x) kmeans(x, myPackage::k_equals_three)$centers)
Is there a better way to do this?

I don't have spark set up to test, but can you just create a closure?
kmeanswithk <- function(k) {force(k); function(x) kmeans(x, k)$centers})
k <- 3
res <- spark_apply(df1, kmeanswithk(k))
Basically just create a function to return a function then use that.

spark_apply() now has a context argument for you to pass additional objects/variables/etc to the environment.
res <- spark_apply(df1, function(x, k) {
kmeans(x, k)$cluster},
context = {k <- 3})
or
k <- 3
res <- spark_apply(df1, function(x, k) {
kmeans(x, k)$cluster},
context = {k})
The R documentation does not include any examples with the context argument, but you might learn more from reading the PR: https://github.com/rstudio/sparklyr/pull/1107.

Related

R Call changing depending on whether variable is named or not in S3 Methods

I'm looking to deal with call evaluations but am out of my depth when it comes to S3 Methods. Basically, I am wondering why a variable that I pass to a function call is not evaluated but rather remains the name of the variable rather than it's value. And all of this depends on whether I name the variable in the function or not.
Let me illustrate with a short example:
I first create a quick function to create a sample class to be used with S3 Methods:
create_myS3 <- function(a, b){
out <- list()
out$a <- a
out$b <- b
class(out) <- "myS3"
return(out)
}
Now the set-up that I am interested in features a number of functions within each other. I first create an S3 method for this myS3 class, let's call it m and we define a specific routine for the myS3 class as well as a default method. Note that the myS3 version calls the default version.
m <- function(x, ...){UseMethod("m")}
m.myS3 <- function(x, estimator = NULL){
y <- list()
y$a <- x$a + 1
y$b <- x$b + 1
out <- m.default(y,
estimator)
return(out)
}
m.default <- function(x, estimator = NULL, ...){
out <- list()
out$call <- sys.call()
out$result <- x$a - x$b
out$aux$estimator <- estimator
return(out)
}
Now that we have defined the functions, we can look at the results function that I'm interested in:
h <- function(x){
out <- list()
out$result_call <- if(is.null(x$call$estimator)){"Success"}else{"Fail"}
out$result_list <- if(is.null(x$aux$estimator)){"Success"}else{"Fail"}
return(out)
}
It's entire purpose is to check whether the estimator element is in the object it is passed to and to give a message based on that.
Ok, now let's put it all together:
g <- function(x){
object <- m(x)
out <- h(object)
return(out)
}
initial <- create_myS3(10,5)
g(initial)
The g() function now calls m() on the input, which was created with the create_myS3 function - so is of class myS3 and is therefore passed to m.myS3 before it is passed to m.default. The resulting object is then passed to h() - in all cases we have not set the estimator argument, which then defaults to NULL and both my check statements in h() return Success.
Now all I do is change one tiny thing: I now modify m.myS3 to call the m.default not just with the order of the input variables but now I also specify the option - in my mind the more robust way. So to clarify, from this m.default(y, estimator) I change it to m.default(x = y, estimator = estimator).
This change then changes my results from h() to Fail for the evaluation if(is.null(x$call$estimator)){"Success"}else{"Fail"} while if(is.null(x$aux$estimator)){"Success"}else{"Fail"} results in Success.
The reason for this is that the call statement evaluates to estimator rather than to its true value NULL.
Is there an easy way to evaluate this call to its true value (I have tried eval or deparse)? Or even better is there are a way to ensure that in m.myS3 the value is always passed rather than the variable?
Here below is the total code for convenience:
create_myS3 <- function(a, b){
out <- list()
out$a <- a
out$b <- b
class(out) <- "myS3"
return(out)
}
m <- function(x, ...){UseMethod("m")}
m.myS3 <- function(x, estimator = NULL){
y <- list()
y$a <- x$a + 1
y$b <- x$b + 1
out <- m.default(y,
estimator)
return(out)
}
m.default <- function(x, estimator = NULL, ...){
out <- list()
out$call <- sys.call()
out$result <- x$a - x$b
out$aux$estimator <- estimator
return(out)
}
h <- function(x){
out <- list()
out$result_call <- if(is.null(x$call$estimator)){"Success"}else{"Fail"}
out$result_list <- if(is.null(x$aux$estimator)){"Success"}else{"Fail"}
return(out)
}
g <- function(x){
object <- m(x)
out <- h(object)
return(out)
}
initial <- create_myS3(10,5)
g(initial)
$result_call
[1] "Success"
$result_list
[1] "Success"
## Changing m.myS3 (only change is to name the option of function m.default)
m.myS3 <- function(x, estimator = NULL){
y <- list()
y$a <- x$a + 1
y$b <- x$b + 1
out <- m.default(x = y,
estimator = estimator)
return(out)
}
g(initial)
$result_call
[1] "Fail"
$result_list
[1] "Success"

how to append an element to a list without keeping track of the index?

I am looking for the r equivalent of this simple code in python
mylist = []
for this in that:
df = 1
mylist.append(df)
basically just creating an empty list, and then adding the objects created within the loop to it.
I only saw R solutions where one has to specify the index of the new element (say mylist[[i]] <- df), thus requiring to create an index i in the loop.
Is there any simpler way than that to just append after the last element.
There is a function called append:
ans <- list()
for (i in 1992:1994){
n <- 1 #whatever the function is
ans <- append(ans, n)
}
ans
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
Note: Using apply functions instead of a for loop is better (not necessarily faster) but it depends on the actual purpose of your loop.
Answering OP's comment: About using ggplot2 and saving plots to a list, something like this would be more efficient:
plotlist <- lapply(seq(2,4), function(i) {
require(ggplot2)
dat <- mtcars[mtcars$cyl == 2 * i,]
ggplot() + geom_point(data = dat ,aes(x=cyl,y=mpg))
})
Thanks to #Wen for sharing Comparison of c() and append() functions:
Concatenation (c) is pretty fast, but append is even faster and therefor preferable when concatenating just two vectors.
There is: mylist <- c(mylist, df) but that's usually not the recommended way in R. Depending on what you're trying to achieve, lapply() is often a better option.
mylist <- list()
for (i in 1:100){
n <- 1
mylist[[(length(mylist) +1)]] <- n
}
This seems to me the faster solution.
x <- 1:1000
aa <- microbenchmark({xx <- list(); for(i in x) {xx <- append(xx, values = i)} })
bb <- microbenchmark({xx <- list(); for(i in x) {xx <- c(xx, i)} } )
cc <- microbenchmark({xx <- list(); for(i in x) {xx[(length(xx) + 1)] <- i} } )
sapply(list(aa, bb, cc), (function(i){ median(i[["time"]]) / 10e5 }))
#{append}=4.466634 #{c}=3.185096 #{this.one}=2.925718
mylist <- list()
for (i in 1:100) {
df <- 1
mylist <- c(mylist, df)
}
Use
first_list = list(a=0,b=1)
newlist = c(first_list,list(c=2,d=3))
print(newlist)
$a
[1] 0
$b
[1] 1
$c
[1] 2
$d
[1] 3
Here's an example:
glmnet_params = list(family="binomial", alpha = 1,
type.measure = "auc",nfolds = 3, thresh = 1e-4, maxit = 1e3)
Now:
glmnet_classifier = do.call("cv.glmnet",
c(list(x = dtm_train, y = train$target), glmnet_params))

R: Using For Loop Variable in Function Declaration

I would like to create a list of functions in R where values from a for loop are stored in the function definition. Here is an example:
init <- function(){
mod <- list()
for(i in 1:3){
mod[[length(mod) + 1]] <- function(x) sum(i + x)
}
return(mod)
}
mod <- init()
mod[[1]](2) # 5 - but I want 3
mod[[2]](2) # 5 - but I want 4
In the above example, regardless of which function I call, i is always the last value in the for loop sequence, I understand this is the correct behavior.
I'm looking for something that achieves this:
mod[[1]] <- function(x) sum(1 + x)
mod[[2]] <- function(x) sum(2 + x)
mod[[3]] <- function(x) sum(3 + x)
You can explicitly ensure i is evaluated at it's current value in the for loop by using force.
init <- function(){
mod <- list()
f_gen = function(i) {
force(i)
return(function(x) sum(i + x))
}
for(i in 1:3){
mod[[i]] <- f_gen(i)
}
return(mod)
}
mod <- init()
mod[[1]](2)
# [1] 3
mod[[2]](2)
# [1] 4
More details are in the Functions/Lazy Evaluation subsection of Advanced R. Also see ?force, of course. Your example is fairly similar to the examples given in ?force.
Using a single-function generator function (f_gen in my code above) seems to make more sense than a list-of-functions generator function. Using my f_gen your code code be simplified:
f_gen = function(i) {
force(i)
return(function(x) sum(i + x))
}
mod2 <- lapply(1:3, f_gen)
mod2[[1]](2)
# [1] 3
mod2[[2]](2)
# [1] 4
## or alternately
mod3 = list()
for (i in 1:3) mod3[[i]] <- f_gen(i)
mod3[[1]](2)
mod3[[2]](2)

R: Parallel execution nested within a sequential loop with dependency

Let's say I have two functions f1 and f2. f2 is designed to take the output of f1 as an argument, and f1 is designed to take its own output to update it. Before the loop starts, output from f1 is initialized. Then within each iteration, f2 takes the previous output from f1 and executes, then f1 executes to update its own output. Two vectors will gather the sequential output from f1 and f2 respectively. The following code is a simple working example:
f1 <- function(x) return(x + pi)
f2 <- function(x) return(log(x))
f1.result <- res1 <- f1(1)
f2.result <- NULL
for(i in 2:100) { ## Need to parallelize these two lines ##
res2 <- f2(res1); f2.result <- c(f2.result, res2)
res1 <- f1(res1); f1.result <- c(f1.result, res1)
}
I am looking to parallelize the two executions inside the loop i.e. to get them run at the same time. How do I achieve this in R? I am familiar with the basics of foreach but can't figure this out. Thanks.
OK I think I figured this out. It's actually pretty simple. I use the doParallel package:
f1 <- function(x) return(x + pi)
f2 <- function(x) return(log(x))
f1.result <- res1 <- f1(1)
f2.result <- NULL
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
getDoParWorkers()
for(j in 2:100) {
res <- foreach(i = 1:2, .combine = c) %dopar% {
if(i==1) res <- f1(res1)
else res <- f2(res1)
}
res1 <- res[1]; f1.result <- c(f1.result, res1)
res2 <- res[2]; f2.result <- c(f2.result, res2)
}
stopCluster(cl)

Create new functions using a list of functions and list of function parameters to Be Passed

I am trying to create new functions from a list of function and a list of parameters to be passed to these functions, but am unable to do so so far. Please see the example below.
fun_list <- list(f = function(x, params) {x+params[1]},
z = function(a, params) {a * params[1] * params[2]})
params_list <- list(f = 1, z = c(3, 5))
# goal is to create 2 new functions in global environment
# fnew <- function(x) {x+1}
# znew <- function(a) {a*3*5}
# I've tried
for(x in names(fun_list)){
force(x)
assign(paste0(x, "new"), function(...) fun_list[[x]] (..., params = params_list[[x]]))
}
The goal is to do this dynamically for arbitrary functions and parameters.
Well, force() doesn't work in a for-loop because for loops do not create new environments. Based on a previous question of mine, I created a capture() function
capture <- function(...) {
vars <- sapply(substitute(...()), deparse);
pf <- parent.frame();
Map(assign, vars, mget(vars, envir=pf, inherits = TRUE), MoreArgs=list(envir=pf))
}
this allows
for(x in names(fun_list)) {
f = local({
capture(x);
p = params_list[[x]];
f = fun_list[[x]];
function(x) f(x, p)
})
assign(paste0(x, "new"), f)
}
where we create a local, private environment for the functions to store their default parameter values.
Which gives
fnew(2)
# [1] 3
znew(2)
# [1] 30
How about this:
for(x in names(fun_list)) {
formals(fun_list[[x]])$params <- params_list[[x]]
assign(paste0(x, "new"), fun_list[[x]])
}
This is similar in spirit:
ps <- list(fp=1,zp=c(3,5))
f0s <- substitute(list(f=function(x)x+fp,z=function(a)a*zp1*zp2),as.list(unlist(ps)))
f0s # list(f = function(x) x + 1, z = function(a) a * 3 * 5)
fs <- eval(f0s)
fs$f(1) # 2
To do the fancy thing described in the OP, you'd probably have to mess with formals.

Resources