multidplyr : assign functions to cluster - r

(see working solution below)
I want to use multidplyr to parallelize a function :
calculs.R
f <- function(x){
return(x+1)
}
main.R
library(dplyr)
library(multidplyr)
source("calculs.R")
d <- data.frame(a=1:1000,b=sample(1:2,1000),replace=T)
result <- d %>%
partition(b) %>%
do(f(.)) %>%
collect()
I then get:
Initialising 3 core cluster.
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
2 nodes produced errors; first error: could not find function "f"
In addition: Warning message:
group_indices_.grouped_df ignores extra arguments
How can I assign sourced functions to each core?
==================
Here is the flawless script:
Must extract the value to update, and turn the result into a dataframe
calcul.R
f <- function(x){
return(data.frame(x$a+1))
}
Must set the clusters and assign the sourced functions
main.R
library(dplyr)
library(multidplyr)
source("calculs.R")
cl <- create_cluster(3)
set_default_cluster(cl)
cluster_copy(cl, f)
d <- data.frame(a=1:10,b=c(rep(1,5),rep(2,5)))
result <- d %>%
partition(b) %>%
do(f(.)) %>%
collect()

It looks like you initialized a cluster (though you don't show this part). You need to export variables/function from your global environment to each worker. Assuming you made your cluster as
cl <- create_cluster(3)
set_default_cluster(cl)
Can you try
cluster_copy(cl, f)
This will copy-and-export f to each worker (I think...)
Extra
You'll likely run into another problem which is that your function accepts x as an argument, to which you add 1
f <- function(x){
return(x+1)
}
Since you're passing a data frame to f, you are asking for data.frame+1, which doesn't make sense. You might want to change your function to something like
f <- function(x){
return(x$a+1)
}

Related

Running code with reprex() and from console produces different result

I am trying to understand why my code produces a different result when run with reprex::reprex() than directly from the script and how to consistently produce the output of the reprex() call. The issue emerges within the filter() call.
Example 1 shows my function filters the data.frame rows based on a column's matches with another vector when I select, copy, and then run it with reprex::reprex() in RStudio.
Example 2 (screenshot from the console output) shows that running the exact same code directly in the script throws a 'match' requires vector arguments error.
Example 3 shows with a slight modification of the function that !!sym() appears to be creating some sort of time series object. Omitting sym() and replace == with %in% has the same consequence.
UPDATE:
The issue did not replicate on others' machines nor my own. I swapped out of an RStudio project to a single .R file and it still persisted. However, when I Cntrl+Shift+F10 to detach libraries, data, etc. the discrepancy vanished. This suggested that I was deal with some sort of namespace issue. Upon returning to the RStudio Project, the issue returned. However, calling dplyr::filter() within the function resolved the issue - reinforcing it being a namespace issue.
While the accepted answer provides some solutions and correctly identifies the issue, the outstanding question (for another post) is why the namespace precedence was not applied in this case when I loaded the package immediately beforehand.
Example 1: !!sym() produces a vector for %in% as expected when code is run with reprex::reprex()
# Packages
library(dplyr)
library(rlang)
# Example data
mydat <- data.frame(type = c("a","b","c","a","c"))
myvec <- c("a","c")
# Example function
foo <- function(df, type_var = "type", vec){
df %>%
filter(!!sym(type_var) %in% vec)
}
# Call function
foo(df = mydat, type_var = "type", vec = myvec)
#> type
#> 1 a
#> 2 c
#> 3 a
#> 4 c
Example 2: Console output shows type error when run from within an R script
Example 3: slightly modified function shows that !!sym() is creating a time series object?!
# Example function
foo <- function(df, type_var = "type", vec){
df %>%
filter(!!sym(type_var) == "a")
}
# Apply function
foo(df = mydat, type_var = "type", vec = myvec)
#>Time Series:
#>Start = 1
#>End = 5
#>Frequency = 1
#> [,1]
#> [1,] 0
#> [2,] 0
#> [3,] 0
#> [4,] 0
#> [5,] 0
It's related to which version of filter is being used and whether it's imported from stats or dplyr. I suspect you have an ~/.Rprofile somewhere that's loading some library functions which are being loaded sometimes and not others.
Changing example 3 to
foo <- function(df, type_var = "type", vec){
df %>%
dplyr::filter(!!sym(type_var) == "a")
}
# Apply function
foo(df = mydat, type_var = "type", vec = myvec)
yields:
type
1 a
2 a
Similarly changing example 1 to:
library(dplyr)
library(rlang)
# Example data
mydat <- data.frame(type = c("a","b","c","a","c"))
myvec <- c("a","c")
# Example function
foo <- function(df, type_var = "type", vec){
df %>%
dplyr::filter(!!sym(type_var) %in% vec)
}
# Call function
foo(df = mydat, type_var = "type", vec = myvec)
gives:
type
1 a
2 c
3 a
4 c
Beware of namespace collisions when running R in console/Rscript etc, it can be hard to track down bugs. filter and lag are the chief culprits (source I almost had to retract a journal paper because lag was imported from the wrong namespace on an Rscript and failed in a weird and silent way).

Access result later in pipe

Access result later in pipe
I am trying to create functions which print the number of rows excluded in a dataset at each step in a pipe.
Something like this:
iris %>%
function_which_save_nrows_and_return_the_data() %>%
filter(exclude some rows) %>%
function_which_prints_difference_in_rows_before_after_exlusion_and_returns_data %>%
function_which_save_nrows_and_return_the_data() %>%
function_which_prints_difference_in_rows_before_after_exlusion_and_returns_data ...etc
These are the functions I have attempted:
n_before = function(x) {assign("rows", nrow(x), .GlobalEnv); return(x)}
n_excluded = function(x) {
print(rows - nrow(x))
return(x)
}
This successfully saves the object rows:
But if I add two more links, the object is NOT saved:
So how can I create and access the rows-object later the pipe?
This is due to R's lazy evaluation. It occurs even if pipes are not used. See code below. In that code the argument to n_excluded is filter(n_before(iris), Species != 'setosa') and at the point that rows is used in the print statement the argument has not been referenced from within n_excluded so the entire argument will not have been evaluated and so rows does not yet exist.
if (exists("rows")) rm(rows) # ensure rows does not exist
n_excluded(filter(n_before(iris), Species != 'setosa'))
## Error in h(simpleError(msg, call)) :
## error in evaluating the argument 'x' in selecting a method for function
## 'print': object 'rows' not found
To fix this
1) we can force x before the print statement.
n_excluded = function(x) {
force(x)
print(rows - nrow(x))
return(x)
}
2) Alternately, we can use the magrittr sequential pipe which guarantees that legs are run in order. magrittr makes it available but does not provide an operator for it but we can assign it to an operator like this.
`%s>%` <- magrittr::pipe_eager_lexical
iris %>%
n_before() %>%
filter(Species != 'setosa') %s>% # note use of %s>% on this line
n_excluded()
The magrittr developer has stated that he will add it as an operator if there is sufficient demand for it so you might want to add such request to magrittr issue #247 on github.
You can also use the extended capabilities of pipeR.
library(dplyr)
library(pipeR)
n_excluded = function(x) {
print(rows - nrow(x))
return(x)
}
p <- iris %>>%
(~rows=nrow(.)) %>>%
filter(Species != "setosa") %>>%
n_excluded()

How to evaluate empty quosure programmatically?

In my dataset, I have a few possible grouping variables a, b, c. How do I programmatically tell dplyr to not group by any variables?
For example:
granularity <- NA
if(isTRUE(granularity == 'all')){
# all group variables
group_variables <- quos(a, b, c)
}else if(isTRUE(granularity == 'no_c')){
# all except c
group_variables <- quos(a, b)
}else{
# no group variables
group_variables <- quo()
}
data_summary <- mydata %>%
group_by(!!! group_variables) %>%
summarise(
x_mean = mean(x)
)
This will run correctly if I set granularity to 'all' or 'no_c', but it fails when I assign group_variables to the empty quosure. Does anyone know how to make this work?
Edit: This question also applies to functions like select, so assume I wanted to run
data_select <- mydata %>%
select(!!! select_variables, d, e, f)
How do I set select_variables to sometimes be quos(a, b, c) or sometimes be empty?
Thanks!
Use group_variables <- NULL in that clause:
}else{
# no group variables
group_variables <- NULL
}
also note the massive warning:
Error in grouped_df_impl(data, unname(vars), drop) :
Column `<empty>` is unknown
In addition: Warning message:
Unquoting language objects with `!!!` is soft-deprecated as of rlang 0.3.0.
Please use `!!` instead.
# Bad:
dplyr::select(data, !!!enquo(x))
# Good:
dplyr::select(data, !!enquo(x)) # Unquote single quosure
dplyr::select(data, !!!enquos(x)) # Splice list of quosures
You might want to consider not using packages with unstable APIs.

How to call the output of a function in another function?

I have two functions:
getTotalBL <- function(Ne, n){
...
total_branch_length #output
}
getSNPnumber <- function(total_branch_length,mu,L){
}
Where the total_branch_length in getSNPnumber is the output of the first function (getTotalBL)
Do I need to do something more than write the same name of the output or is it correct this way?
You need to store the output of getTotalBL in an object and pass that on as a function argument to getSNPnumber. The scope of total_branch_length is restricted to getTotalBL.
Here are two examples to demonstrate:
Possibility 1:
f1 <- function(x) x^2;
f2 <- function(xsquared, b) xsquared + b;
f2(f1(2), 1)
#[1] 5
which is the same as
ret_from_f1 <- f1(2);
f2(ret_from_f1, 1);
#[1] 5
Possibility 2:
We can also have a function as an argument of another function (here f2):
f2 <- function(fct, x, b) fct(x) + b;
f2(f1, 2, 1)
#[1] 5
If all you're interested in is transferring the results from one function into another, I'd like to suggest the %>% function; it lets you pipe/chain results from one command into another.
It's available in packages magrittr (ordplyr if you're already using tidyverse).
Reusing the above 'Possibility 1'
f1 <- function(x) x^2;
f2 <- function(xsquared, b) xsquared + b;
require(dplyr)
f1(2) %>% f2(1)
UPDATE: Why %>% is useful
To my extremely limited knowledge, R stores all objects in RAM. When you create objects, only for them to be removed, they are still created in RAM. Using %>% lets you bypass this.

Create a list of functions from a vector of characters

Thanks in advance, and sorry if this question has been answered previously - I have looked pretty extensively. I have a dataset containing a row of with concatenated information, specifically: name,color code,some function expression. For example, one value may be:
cost#FF0033#log(x)+6.
I have all of the code to extract the information, and I end up with a vector of expressions that I would like to convert to a list of actual functions.
For example:
func.list <- list()
test.func <- c("x","x+1","x+2","x+3","x+4")
where test.func is the vector of expressions. What I would like is:
func.list[[3]]
To be equivalent to
function(x){x+3}
I know that I can create a function using:
somefunc <- function(x){eval(parse(text="x+1"))}
to convert a character value into a function. The problem comes when I try and loop through to make multiple functions. For an example of something I tried that didn't work:
for(i in 1:length(test.func)){
temp <- test.func[i]
f <- assign(function(x){eval(expr=parse(text=temp))})
func.list[[i]] <- f
}
Based on another post (http://stats.stackexchange.com/questions/3836/how-to-create-a-vector-of-functions) I also tried this:
makefunc <- function(y){y;function(x){y}}
for(i in 1:length(test.func)){
func.list[[i]] <- assign(x=paste("f",i,sep=""),value=makefunc(eval(parse(text=test.func[i]))))
}
Which gives the following error: Error in eval(expr, envir, enclos) : object 'x' not found
The eventual goal is to take the list of functions and apply the jth function to the jth column of the data.frame, so that the user of the script can specify how to normalize each column within the concatenated information given by the column header.
Maybe initialize your list with a single generic function, and then update them using:
foo <- function(x){x+3}
> body(foo) <- quote(x+4)
> foo
function (x)
x + 4
More specifically, starting from a character, you'd probably do something like:
body(foo) <- parse(text = "x+5")
Just to add onto joran's answer, this is what finally worked:
test.data <- matrix(data=rep(1,25),5,5)
test.data <- data.frame(test.data)
test.func <- c("x","x+1","x+2","x+3","x+4")
func.list <- list()
for(i in 1:length(test.func)){
func.list[[i]] <- function(x){}
body(func.list[[i]]) <- parse(text=test.func[i])
}
processed <- mapply(do.call,func.list,lapply(test.data,list))
Thanks again, joran.
This is what I do:
f <- list(identity="x",plus1 = "x+1", square= "x^2")
funCreator <- function(snippet){
txt <- snippet
function(x){
exprs <- parse(text = txt)
eval(exprs)
}
}
listOfFunctions <- lapply(setNames(f,names(f)),function(x){funCreator(x)}) # I like to have some control of the names of the functions
listOfFunctions[[1]] # try to see what the actual function looks like?
library(pryr)
unenclose(listOfFunctions[[3]]) # good way to see the actual function http://adv-r.had.co.nz/Functional-programming.html
# Call your funcions
listOfFunctions[[2]](3) # 3+1 = 4
do.call(listOfFunctions[[3]],list(3)) # 3^2 = 9
attach(listOfFunctions) # you can also attach your list of functions and call them by name
square(3) # 3^2 = 9
identity(7) # 7 ## masked object identity, better detach it now!
detach(listOfFunctions)

Resources