Using plyr ldply parallel with function within function - r

I have a data frame with multiple IDs and I am trying to perform feature extraction on the different ID sets. The data looks like this:
id x y
1 3812 60 7
2 3812 63 105
3 3812 65 1000
4 3812 69 8
5 3812 75 88
6 3812 78 13
where id takes on about 200 different values. So I am trying to extract features from the (x,y) data, and I'd like to do it in parallel, since for some datasets, doing it sequentially can take about 20 minutes or so. Right now I am using dplyr as such:
x = d %>% group_by(id) %>% do(data.frame(getFeatures(., func_args))
where func_args are just additional function inputs to the function getFeaures. I am trying to use plyr::ldply with parallel=TRUE to do this, but there is a problem in that within getFeatures, I am using another function that I've written. So, when I try to run parallel, I get an error:
Error in do.ply(i) :
task 1 failed - "could not find function "desparsify""
In addition: Warning messages:
1: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
Error in do.ply(i) :
task 1 failed - "could not find function "desparsify""
where desparsify is a custom function written to process the (x,y) data (it effectively adds zeros to x locations that are not present in the dataset). I get a similar error when I try to use the cosine function from package lsa. Is there a way to use parallel processing when calling external/non-base functions in R?

You don't show how you set up plyr to parallelize, but I think I can guess what you're doing. I also guess you're on Windows. Here's a teeny standalone example illustrating what's going on:
library(plyr)
## On Windows, doParallel::registerDoParallel(2) becomes:
cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl)
desparsify <- function(x) sqrt(x)
y <- plyr::llply(1:3, function(x) desparsify(x), .parallel=TRUE)
## Error in do.ply(i) :
## task 1 failed - "could not find function "desparsify""
If you use doFuture instead of doParallel, the underlying future framework will make sure 'desparsify' is found, e.g.
library(plyr)
doFuture::registerDoFuture()
future::plan("multisession", workers = 2)
desparsify <- function(x) sqrt(x)
y <- plyr::llply(1:3, function(x) desparsify(x), .parallel=TRUE)
str(y)
## List of 3
## $ : num 1
## $ : num 1.41
## $ : num 1.73
(disclaimer: I'm the author of the future framework)
PS. Note that plyr is a legacy package no longer maintained. You might want to look into future.apply, furrr, or foreach with doFuture as alternatives for parallelization.

There is. Take a look in the parApply functions family. I usually use the parLapply one.
You'll need to set the number of cores with cl <- makeCluster(number of cores) and pass it, together with a vector of your ids (may depend on how your functions identify the entries for each id) and your functions, to parLapply to produce a list with the output of your function applied to each group in parallel.
cl <- makeCluster(number of cores)
ids=1:10
clusterExport(cl=cl,varlist=c('variable name','function name')) ## in case you need to export variable/functions
result=parLapply(cl=cl,ids, your function)
stopCluster(cl)

Related

Number of masked objects keeps increasing when using attach in R [duplicate]

Let's assume that we have a data frame x which contains the columns job and income. Referring to the data in the frame normally requires the commands x$jobfor the data in the job column and x$income for the data in the income column.
However, using the command attach(x) permits to do away with the name of the data frame and the $ symbol when referring to the same data. Consequently, x$job becomes job and x$income becomes income in the R code.
The problem is that many experts in R advise NOT to use the attach() command when coding in R.
What is the main reason for that? What should be used instead?
When to use it:
I use attach() when I want the environment you get in most stats packages (eg Stata, SPSS) of working with one rectangular dataset at a time.
When not to use it:
However, it gets very messy and code quickly becomes unreadable when you have several different datasets, particularly if you are in effect using R as a crude relational database, where different rectangles of data, all relevant to the problem at hand and perhaps being used in various ways of matching data from the different rectangles, have variables with the same name.
The with() function, or the data= argument to many functions, are excellent alternatives to many instances where attach() is tempting.
Another reason not to use attach: it allows access to the values of columns of a data frame for reading (access) only, and as they were when attached. It is not a shorthand for the current value of that column. Two examples:
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> dist <- 0.3048 * dist
> # convert speed to meters per second
> speed <- 0.44707 * speed
> # compute a meaningless time
> time <- dist / speed
> # check our work
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
No changes were made to the cars data set even though dist and speed were assigned to.
If explicitly assigned back to the data set...
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> cars$dist <- 0.3048 * dist
> # convert speed to meters per second
> cars$speed <- 0.44707 * speed
> # compute a meaningless time
> cars$time <- dist / speed
> # compute meaningless time being explicit about using values in cars
> cars$time2 <- cars$dist / cars$speed
> # check our work
> head(cars)
speed dist time time2
1 1.78828 0.6096 0.5000000 0.3408862
2 1.78828 3.0480 2.5000000 1.7044311
3 3.12949 1.2192 0.5714286 0.3895842
4 3.12949 6.7056 3.1428571 2.1427133
5 3.57656 4.8768 2.0000000 1.3635449
6 4.02363 3.0480 1.1111111 0.7575249
the dist and speed that are referenced in computing time are the original (untransformed) values; the values of cars$dist and cars$speed when cars was attached.
I think there's nothing wrong with using attach. I myself don't use it (then again, I love animals, but don't keep any, either). When I think of attach, I think long term. Sure, when I'm working with a script I know it inside and out. But in one week's time, a month or a year when I go back to the script, I find the overheads with searching where a certain variable is from, just too expensive. A lot of methods have the data argument which makes calling variables pretty easy (sensulm(x ~ y + z, data = mydata)). If not, I find the usage of with to my satisfaction.
In short, in my book, attach is fine for short quick data exploration, but for developing scripts that I or other might want to use, I try to keep my code as readable (and transferable) as possible.
If you execute attach(data) multiple time, eg, 5 times, then you can see (with the help of search()) that your data has been attached 5 times in the workspace environment. So if you de-attach (detach(data)) it once, there'll still be data present 4 times in the environment. Hence, with()/within() are better options. They help create a local environment containing that object and you can use it without creating any confusions.

Issue with doParallel and foreach. I can register cores, but they don't seem to run

I have a large data set (about 40 million rows x 4 columns) and I want to perform a Fisher test on the data in each row.
An example of the data looks like this:
refAppleBase altAppleBase refHawBase altHawBase
1 115 1 94 0
2 117 4 93 1
3 125 4 94 1
4 107 26 89 12
5 87 53 66 38
6 68 58 64 32
I have written the following script that essentially takes each row, converts into a matrix so it can run on the base fisher.test() function in R, and then spits out the odds ratio and p-value.
fisher.odds.pval <- function(table){
fisher <- fisher.test(matrix(unlist(table), nrow=2, ncol=2))
p.val <- fisher$p.value
odds <- unname(fisher$estimate)
return(cbind(odds, p.val))
}
Now, obviously this a little clunky and I want to run it across 40 million rows, so to save time, I wrote the following script, using the foreach and doParallel packages to parallelize this across multiple cores.
library(doParallel)
library(foreach)
cl <- makeCluster(10)
registerDoParallel(cl)
results <- foreach(i=1:nrow(dat)) %dopar% {
fisher.odds.pval(table=dat[i,])
}
stopCluster(cl)
I have used doParallelin the past to great success. However, when running the above script, I can see the cores "wake up" and load in data, but then immediately go to sleep. Then it seems that just one core is doing all of the computing. Here is a screen grab of top, when the above code is running.
top screen grab
Note: When I run the above script on a smaller dataset using %do% instead of %dopar% it works, so I suspect there something fishy going on between the way foreach and doParallel are communicating? But really lost here right now. Any thought greatly appreciated.
I think I am replicating the same behavior on Windows. The function makeCluster() belongs to the parallel package and their back-end for parallelization, which is distinct from doParallel's back-end. This will work with snow and their parallel functions clusterApply(), etc.
If you go straight to registerDoParallel(cl = 10), or registerDoParallel(cores = 10), it will register the doParallel back-end for use with foreach() -- my system shows proper allocation on all cores this way using your function and data.
To stop the workers, use registerDoSEQ(). To show number of workers initialized, use getDoParWorkers()
The main problem here is that, if you're using Windows, R has to transfer dat to each cluster (which is slow and uses a lot of memory).
One possible solution would be to use shared memory (more info there).
Reproducible data
df <- read.table(text = "refAppleBase altAppleBase refHawBase altHawBase
1 115 1 94 0
2 117 4 93 1
3 125 4 94 1
4 107 26 89 12
5 87 53 66 38
6 68 58 64 32")
dat <- df[rep(1:4, 1e7), ]
fisher.odds.pval <- function(table){
fisher <- fisher.test(matrix(unlist(table), nrow=2, ncol=2))
p.val <- fisher$p.value
odds <- unname(fisher$estimate)
return(cbind(odds, p.val))
}
Your current solution (uses lot of memory!!)
library(doParallel)
registerDoParallel(cl <- makeCluster(10))
results <- foreach(i=1:100) %dopar% {
fisher.odds.pval(table=dat[i,])
}
stopCluster(cl)
One solution using shared memory
library(doParallel)
# devtools::install_github("privefl/bigstatsr")
fbm <- bigstatsr::as_FBM(dat, type = "integer")
registerDoParallel(cl <- makeCluster(10))
results2 <- foreach(i=1:100) %dopar% {
fisher.odds.pval(table=fbm[i,])
}
stopCluster(cl)
Note that you will gain more by optimizing (e.g. vectorizing) the sequential version, instead of directly relying on parallelism.

R Pipelining with Anonymous Functions

I have a question which is an extension of another question.
I am wanting to be able to pipeline anonymous functions. In the previous question the answer to pipeline defined functions was to create a pipeline operator "%|>%" and to define it this way:
"%|>%" <- function(fun1, fun2){
function(x){fun2(fun1(x))}
}
This would allow you to call a series of functions while continually passing the result of the previous function to the next. The caveat was that the functions to to be predefined. Now I'm trying to figure how to do this with anonymous functions. The previous solution which used predefined functions looks like this:
square <- function(x){x^2}
add5 <- function(x){x + 5}
pipelineTest <-
square %|>%
add5
Which gives you this behviour:
> pipelineTest(1:10)
[1] 6 9 14 21 30 41 54 69 86 105
I would like to be able to define the pipelineTest function with anonymous functions like this:
anonymousPipelineTest <-
function(x){x^2} %|>%
function(x){x+5} %|>%
x
When I try to call this with the same arguments as above I get the following:
> anonymousPipelineTest(1:10)
function(x){fun2(fun1(x))}
<environment: 0x000000000ba1c468>
What I'm hoping to get is the same result as pipelineTest(1:10). I know that this is a trivial example. What I'm really trying to get at is a way to pipeline anonymous functions. Thanks for the help!
Using Compose, and calling the resulting function gives this:
"%|>%" <- function(...) Compose(...)()
Now get rid of the 'x' as the final "function" (replaced with an actual function, that is not needed but here for example):
anonymousPipelineTest <-
function(x){x^2} %|>%
function(x){x+5} %|>% function(x){x}
anonymousPipelineTest(1:10)
[1] 6 9 14 21 30 41 54 69 86 105
This is an application of an example offered on the ?funprog help page:
Funcall <- function(f, ...) f(...)
anonymousPipelineTest <- function(x) Reduce( Funcall, list(
function(x){x+5}, function(x){x^2}),
x, right=TRUE)
anonymousPipelineTest(1:10)
#[1] 6 9 14 21 30 41 54 69 86 105
I am putting up an answer which is the closest thing I've found for other people looking for the same thing. I won't give myself point for the answer though because it is not what I am wanting.
Returning a Function:
If you want to put several functions together the easiest thing I've found is to use the 'Compose' function found in the 'Functional' package for R. It would look something like this:
anonymousPipe <- Compose(
function(x){x^2},
function(x){x+5})
This allows you to call this series of functions like this:
> anonymousPipe(1:10)
[1] 6 9 14 21 30 41 54 69 86 105
Returning Data:
If all you want to do is start with some data and send it through a series of transformations (my original intent) then the first function in the 'Compose' function should be your starting data and after the close of the 'Compose' function add a parenthesis pair to call the function. It looks like this:
anonymousPipeData <- Compose(
seq(1:10),
function(x){x^2},
function(x){x+5})()
'anonymousPipeData' is now the data which is a result of the series of functions. Please note the pair of parenthesis at the end. This is what causes R to return the data rather than a function.

Why is it not advisable to use attach() in R, and what should I use instead?

Let's assume that we have a data frame x which contains the columns job and income. Referring to the data in the frame normally requires the commands x$jobfor the data in the job column and x$income for the data in the income column.
However, using the command attach(x) permits to do away with the name of the data frame and the $ symbol when referring to the same data. Consequently, x$job becomes job and x$income becomes income in the R code.
The problem is that many experts in R advise NOT to use the attach() command when coding in R.
What is the main reason for that? What should be used instead?
When to use it:
I use attach() when I want the environment you get in most stats packages (eg Stata, SPSS) of working with one rectangular dataset at a time.
When not to use it:
However, it gets very messy and code quickly becomes unreadable when you have several different datasets, particularly if you are in effect using R as a crude relational database, where different rectangles of data, all relevant to the problem at hand and perhaps being used in various ways of matching data from the different rectangles, have variables with the same name.
The with() function, or the data= argument to many functions, are excellent alternatives to many instances where attach() is tempting.
Another reason not to use attach: it allows access to the values of columns of a data frame for reading (access) only, and as they were when attached. It is not a shorthand for the current value of that column. Two examples:
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> dist <- 0.3048 * dist
> # convert speed to meters per second
> speed <- 0.44707 * speed
> # compute a meaningless time
> time <- dist / speed
> # check our work
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
No changes were made to the cars data set even though dist and speed were assigned to.
If explicitly assigned back to the data set...
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> cars$dist <- 0.3048 * dist
> # convert speed to meters per second
> cars$speed <- 0.44707 * speed
> # compute a meaningless time
> cars$time <- dist / speed
> # compute meaningless time being explicit about using values in cars
> cars$time2 <- cars$dist / cars$speed
> # check our work
> head(cars)
speed dist time time2
1 1.78828 0.6096 0.5000000 0.3408862
2 1.78828 3.0480 2.5000000 1.7044311
3 3.12949 1.2192 0.5714286 0.3895842
4 3.12949 6.7056 3.1428571 2.1427133
5 3.57656 4.8768 2.0000000 1.3635449
6 4.02363 3.0480 1.1111111 0.7575249
the dist and speed that are referenced in computing time are the original (untransformed) values; the values of cars$dist and cars$speed when cars was attached.
I think there's nothing wrong with using attach. I myself don't use it (then again, I love animals, but don't keep any, either). When I think of attach, I think long term. Sure, when I'm working with a script I know it inside and out. But in one week's time, a month or a year when I go back to the script, I find the overheads with searching where a certain variable is from, just too expensive. A lot of methods have the data argument which makes calling variables pretty easy (sensulm(x ~ y + z, data = mydata)). If not, I find the usage of with to my satisfaction.
In short, in my book, attach is fine for short quick data exploration, but for developing scripts that I or other might want to use, I try to keep my code as readable (and transferable) as possible.
If you execute attach(data) multiple time, eg, 5 times, then you can see (with the help of search()) that your data has been attached 5 times in the workspace environment. So if you de-attach (detach(data)) it once, there'll still be data present 4 times in the environment. Hence, with()/within() are better options. They help create a local environment containing that object and you can use it without creating any confusions.

writing to global environment when running in parallel

I have a data.frame of cells, values and coordinates. It resides in the global environment.
> head(cont.values)
cell value x y
1 11117 NA -34 322
2 11118 NA -30 322
3 11119 NA -26 322
4 11120 NA -22 322
5 11121 NA -18 322
6 11122 NA -14 322
Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each cell can be calculated independently, screaming for parallel execution.
What my function actually does is check if there's a value for a specified cell number and if it's NA, it calculates it and inserts it in place of NA.
I can run my magic function (result is value for a corresponding cell) using apply family of functions and from within apply, I can read and write cont.values without a problem (it's in global environment).
Now, I want to run this in parallel (using snowfall) and I'm unable to read or write from/to this variable from individual core.
Question: What solution would be able to read/write from/to a dynamic variable residing in global environment from within worker (core) when executing a function in parallel. Is there a better approach of doing this?
The pattern of a central store that workers consult for values is implemented in the rredis package on CRAN. The idea is that the Redis server maintains a store of key-value pairs (your global data frame, re-implemented). Workers query the server to see if the value has been calculated (redisGet) and if not do the calculation and store it (redisSet) so that other workers can re-use it. Workers can be R scripts, so it's easy to expand the work force. It's a very nice alternative parallel paradigm. Here's an example that uses the notion of 'memoizing' each result. We have a function that is slow (sleeps for a second)
fun <- function(x) { Sys.sleep(1); x }
We write a 'memoizer' that returns a variant of fun that first checks to see if the value for x has already been calculated, and if so uses that
memoize <-
function(FUN)
{
force(FUN) # circumvent lazy evaluation
require(rredis)
redisConnect()
function(x)
{
key <- as.character(x)
val <- redisGet(key)
if (is.null(val)) {
val <- FUN(x)
redisSet(key, val)
}
val
}
}
We then memoize our function
funmem <- memoize(fun)
and go
> system.time(res <- funmem(10)); res
user system elapsed
0.003 0.000 1.082
[1] 10
> system.time(res <- funmem(10)); res
user system elapsed
0.001 0.001 0.040
[1] 10
This does require a redis server running outside R but very easy to install; see the documentation that comes with the rredis package.
A within-R parallel version might be
library(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
clusterEvalQ(cl, { require(rredis); redisConnect() })
tasks <- sample(1:5, 100, TRUE)
system.time(res <- parSapply(cl, tasks, funmem))
It will depend on what the function in question is, off course, but I'm afraid that snowfall won't be much of a help there. Thing is, you'll have to export the whole dataframe to the different cores (see ?sfExport) and still find a way to combine it. That kind of beats the whole purpose of changing the value in the global environment, as you probably want to keep memory use as low as possible.
You can dive into the low-level functions of snow to -kind of- get this to work. See following example :
#Some data
Data <- data.frame(
cell = 1:10,
value = sample(c(100,NA),10,TRUE),
x = 1:10,
y = 1:10
)
# A sample function
sample.func <- function(){
id <- which(is.na(Data$value)) # get the NA values
# this splits up the values from the dataframe in a list
# which will be passed to clusterApply later on.
parts <- lapply(clusterSplit(cl,id),function(i)Data[i,c("x","y")])
# Here happens the magic
Data$value[id] <<-
unlist(clusterApply(cl,parts,function(x){
x$x+x$y
}
))
}
#now we run it
require(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
sample.func()
stopCluster(cl)
> Data
cell value x y
1 1 100 1 1
2 2 100 2 2
3 3 6 3 3
4 4 8 4 4
5 5 10 5 5
6 6 12 6 6
7 7 100 7 7
8 8 100 8 8
9 9 18 9 9
10 10 20 10 10
You will still have to copy (part of) your data though to get it to the cores. But that will happen anyway when you call snowfall high level functions on dataframes, as snowfall uses the low-level function of snow anyway.
Plus, one shouldn't forget that if you change one value in a dataframe, the whole dataframe is copied in the memory as well. So you won't win that much by adding the values one by one when they come back from the cluster. You might want to try some different approaches and do some memory profiling as well.
I agree with Joris that you will need to copy your data to the other cores.
On the positive side, you don't have to worry about NA's being in the data or not, within the cores.
If your original data.frame is called cont.values:
nnaidx<-is.na(cont.values$value) #where is missing data originally
dfrnna<-cont.values[nnaidx,] #subset for copying to other cores
calcValForDfrRow<-function(dfrRow){return(dfrRow$x+dfrRow$y)}#or whatever pleases you
sfExport(dfrnna, calcValForDfrRow) #export what is needed to other cores
cont.values$value[nnaidx]<-sfSapply(seq(dim(dfrnna)[1]), function(i){calcValForDfrRow(dfrnna[i,])}) #sfSapply handles 'reordering', so works exactly as if you had called sapply
should work nicely (barring typos)

Resources