I have a data.frame of cells, values and coordinates. It resides in the global environment.
> head(cont.values)
cell value x y
1 11117 NA -34 322
2 11118 NA -30 322
3 11119 NA -26 322
4 11120 NA -22 322
5 11121 NA -18 322
6 11122 NA -14 322
Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each cell can be calculated independently, screaming for parallel execution.
What my function actually does is check if there's a value for a specified cell number and if it's NA, it calculates it and inserts it in place of NA.
I can run my magic function (result is value for a corresponding cell) using apply family of functions and from within apply, I can read and write cont.values without a problem (it's in global environment).
Now, I want to run this in parallel (using snowfall) and I'm unable to read or write from/to this variable from individual core.
Question: What solution would be able to read/write from/to a dynamic variable residing in global environment from within worker (core) when executing a function in parallel. Is there a better approach of doing this?
The pattern of a central store that workers consult for values is implemented in the rredis package on CRAN. The idea is that the Redis server maintains a store of key-value pairs (your global data frame, re-implemented). Workers query the server to see if the value has been calculated (redisGet) and if not do the calculation and store it (redisSet) so that other workers can re-use it. Workers can be R scripts, so it's easy to expand the work force. It's a very nice alternative parallel paradigm. Here's an example that uses the notion of 'memoizing' each result. We have a function that is slow (sleeps for a second)
fun <- function(x) { Sys.sleep(1); x }
We write a 'memoizer' that returns a variant of fun that first checks to see if the value for x has already been calculated, and if so uses that
memoize <-
function(FUN)
{
force(FUN) # circumvent lazy evaluation
require(rredis)
redisConnect()
function(x)
{
key <- as.character(x)
val <- redisGet(key)
if (is.null(val)) {
val <- FUN(x)
redisSet(key, val)
}
val
}
}
We then memoize our function
funmem <- memoize(fun)
and go
> system.time(res <- funmem(10)); res
user system elapsed
0.003 0.000 1.082
[1] 10
> system.time(res <- funmem(10)); res
user system elapsed
0.001 0.001 0.040
[1] 10
This does require a redis server running outside R but very easy to install; see the documentation that comes with the rredis package.
A within-R parallel version might be
library(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
clusterEvalQ(cl, { require(rredis); redisConnect() })
tasks <- sample(1:5, 100, TRUE)
system.time(res <- parSapply(cl, tasks, funmem))
It will depend on what the function in question is, off course, but I'm afraid that snowfall won't be much of a help there. Thing is, you'll have to export the whole dataframe to the different cores (see ?sfExport) and still find a way to combine it. That kind of beats the whole purpose of changing the value in the global environment, as you probably want to keep memory use as low as possible.
You can dive into the low-level functions of snow to -kind of- get this to work. See following example :
#Some data
Data <- data.frame(
cell = 1:10,
value = sample(c(100,NA),10,TRUE),
x = 1:10,
y = 1:10
)
# A sample function
sample.func <- function(){
id <- which(is.na(Data$value)) # get the NA values
# this splits up the values from the dataframe in a list
# which will be passed to clusterApply later on.
parts <- lapply(clusterSplit(cl,id),function(i)Data[i,c("x","y")])
# Here happens the magic
Data$value[id] <<-
unlist(clusterApply(cl,parts,function(x){
x$x+x$y
}
))
}
#now we run it
require(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
sample.func()
stopCluster(cl)
> Data
cell value x y
1 1 100 1 1
2 2 100 2 2
3 3 6 3 3
4 4 8 4 4
5 5 10 5 5
6 6 12 6 6
7 7 100 7 7
8 8 100 8 8
9 9 18 9 9
10 10 20 10 10
You will still have to copy (part of) your data though to get it to the cores. But that will happen anyway when you call snowfall high level functions on dataframes, as snowfall uses the low-level function of snow anyway.
Plus, one shouldn't forget that if you change one value in a dataframe, the whole dataframe is copied in the memory as well. So you won't win that much by adding the values one by one when they come back from the cluster. You might want to try some different approaches and do some memory profiling as well.
I agree with Joris that you will need to copy your data to the other cores.
On the positive side, you don't have to worry about NA's being in the data or not, within the cores.
If your original data.frame is called cont.values:
nnaidx<-is.na(cont.values$value) #where is missing data originally
dfrnna<-cont.values[nnaidx,] #subset for copying to other cores
calcValForDfrRow<-function(dfrRow){return(dfrRow$x+dfrRow$y)}#or whatever pleases you
sfExport(dfrnna, calcValForDfrRow) #export what is needed to other cores
cont.values$value[nnaidx]<-sfSapply(seq(dim(dfrnna)[1]), function(i){calcValForDfrRow(dfrnna[i,])}) #sfSapply handles 'reordering', so works exactly as if you had called sapply
should work nicely (barring typos)
Related
I have a data frame with multiple IDs and I am trying to perform feature extraction on the different ID sets. The data looks like this:
id x y
1 3812 60 7
2 3812 63 105
3 3812 65 1000
4 3812 69 8
5 3812 75 88
6 3812 78 13
where id takes on about 200 different values. So I am trying to extract features from the (x,y) data, and I'd like to do it in parallel, since for some datasets, doing it sequentially can take about 20 minutes or so. Right now I am using dplyr as such:
x = d %>% group_by(id) %>% do(data.frame(getFeatures(., func_args))
where func_args are just additional function inputs to the function getFeaures. I am trying to use plyr::ldply with parallel=TRUE to do this, but there is a problem in that within getFeatures, I am using another function that I've written. So, when I try to run parallel, I get an error:
Error in do.ply(i) :
task 1 failed - "could not find function "desparsify""
In addition: Warning messages:
1: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
Error in do.ply(i) :
task 1 failed - "could not find function "desparsify""
where desparsify is a custom function written to process the (x,y) data (it effectively adds zeros to x locations that are not present in the dataset). I get a similar error when I try to use the cosine function from package lsa. Is there a way to use parallel processing when calling external/non-base functions in R?
You don't show how you set up plyr to parallelize, but I think I can guess what you're doing. I also guess you're on Windows. Here's a teeny standalone example illustrating what's going on:
library(plyr)
## On Windows, doParallel::registerDoParallel(2) becomes:
cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl)
desparsify <- function(x) sqrt(x)
y <- plyr::llply(1:3, function(x) desparsify(x), .parallel=TRUE)
## Error in do.ply(i) :
## task 1 failed - "could not find function "desparsify""
If you use doFuture instead of doParallel, the underlying future framework will make sure 'desparsify' is found, e.g.
library(plyr)
doFuture::registerDoFuture()
future::plan("multisession", workers = 2)
desparsify <- function(x) sqrt(x)
y <- plyr::llply(1:3, function(x) desparsify(x), .parallel=TRUE)
str(y)
## List of 3
## $ : num 1
## $ : num 1.41
## $ : num 1.73
(disclaimer: I'm the author of the future framework)
PS. Note that plyr is a legacy package no longer maintained. You might want to look into future.apply, furrr, or foreach with doFuture as alternatives for parallelization.
There is. Take a look in the parApply functions family. I usually use the parLapply one.
You'll need to set the number of cores with cl <- makeCluster(number of cores) and pass it, together with a vector of your ids (may depend on how your functions identify the entries for each id) and your functions, to parLapply to produce a list with the output of your function applied to each group in parallel.
cl <- makeCluster(number of cores)
ids=1:10
clusterExport(cl=cl,varlist=c('variable name','function name')) ## in case you need to export variable/functions
result=parLapply(cl=cl,ids, your function)
stopCluster(cl)
I am trying to find the most efficient way to repeatedly search for combinations of two variables in a reference table. The problem is based on an implementation of a hill climbing algorithm with annealing step size and, as such, adds a lot of complexity to the problem.
To explain, say I have two variables A and B that I want to optimise. I start with 100 combinations of these variables that I will iterate through
set.seed(100)
A_start <- sample(1000,10,rep=F)
B_start <- sample(1000,10,rep=F)
A_B_starts<-expand.grid(A = A_start,
B = B_start)
head(A_B_starts)
A B
1 714 823
2 503 823
3 358 823
4 624 823
5 985 823
6 718 823
For each of these start combinations, I want to use their immediate neighbours in a predictive model and if their error is less than that of the start combination, continue in that direction. This is repeated until a max number of iterations is hit or the error increases (standard hill climbing). I, however, do not want to recheck combinations I have already looked at so, to do this I want to use a reference table to store checked combinations. I then check each time I generate the immediate neighbours if they are in the reference table before running the predictive model. Any that are present are simply removed. More complexity is added because I want the step size that generates the immediate neighbours to be annealing; get smaller over time. I have implemented this using data.tables
max_iterations <-1e+06
#Set max size so efficient to add new combinations, max size is 100 start points by max iterations allowed
ref <-data.table(A=numeric(),
B=numeric(),
key=c("A","B"))[1:(100*max_iterations)]
ref
A B
1: NA NA
2: NA NA
3: NA NA
4: NA NA
5: NA NA
---
99999996: NA NA
99999997: NA NA
99999998: NA NA
99999999: NA NA
100000000: NA NA
So the loop to actually go through the problem
step_A <- 5
step_B <- 5
visited_counter <- 1L
for(start_i in 1:nrow(A_B_starts)){
initial_error <- get.error.pred.model(A_B_starts[1,])
A <-A_B_starts[1,1]
B <-A_B_starts[1,2]
#Add start i to checked combinations
set(ref, i=visited_counter, j="A", value=A)
set(ref, i=visited_counter, j="B", value=B)
visited_counter <- visited_counter+1L
iterations <- 1
while(iterations<max_iterations){
#Anneal step
decay_A = step_A / iterations
decay_B = step_B / iterations
step_A <- step_A * 1/(1+decay_A*iterations)
step_B <- step_B * 1/(1+decay_B*iterations)
#Get neighbours to check
to_visit_A <- c(A+step_A,A-step_A)
to_visit_B <- c(B+step_B,B-step_B)
to_visit <- setDT(expand.grid("A"=to_visit_A,"B" = to_visit_B),
key=c("A","B"))
#Now check if any combination have been checked before and remove if so
#set key for efficient searching - need to reset in loop as you are updating values in datatable
setkey(ref,A,B)
prev_visited<-ref[to_visit,nomatch=0L]
to_visit<-to_visit[!prev_visited]
#Run model on remaining combinations and if error reducing continue
best_neighbour <- get.min.error.pred.model(to_visit)
if(best_neighbour$error<initial_error){
initial_error <- best_neighbour_error
A <- best_neighbour$A
B <- best_neighbour$B
}
else{
iterations <- max_iterations
}
#Add all checked to reference and also update the number of iterations
for(visit_i in 1L:nrow(to_visit)){
#This will reset the key of the data table
set(ref, i=visited_counter, j="A", value=to_visit[visit_i,1])
set(ref, i=visited_counter, j="B", value=to_visit[visit_i,2])
visited_neighbour_counter <- visited_counter+1L
iterations <- iterations+1
}
}
}
The problem with this approach is that I have to reset the key each loop iteration as a new combination has been added to ref which makes it very slow:
setkey(ref,A,B)
prev_visited<-ref[to_visit,nomatch=0L]
to_visit<-to_visit[!prev_visited]
Also, the reason I mention the annealing is because I had another idea to use a sparse matrix; Matrix to hold indicators of pairs already checked which would allow very quick checks
require(Matrix)
#Use a sparse matrix for efficient search and optimum RAM usage
sparse_matrix <- sparseMatrix(A = 1:(100*1e+06),
B = 1:(100*1e+06) )
However, since the step size is variable i.e. A/B can hold any value with increasingly small intervals, I don't know how to initialise appropriate values of A and B in the sparse matrix to capture all possible combinations checked?
(Not really an answer, but too long for a comment.)
If the number of possible solutions is huge, it might
be impractical or impossible to store them all. What
is more, the fastest way to look up a single solution
is generally a hashtable; but setting up the hashtable
is slow, so you might not gain much (your objective
function needs to be more expensive that this
set-up/look-up-oberhead). Depending on the problem,
much of this storing solutions might be a waste; the
algorithm may never revisit them. An alternative
suggestion might be a first-in/first-out data
structure, which simply stores the last N solutions
that have been visited. (Even a linear look-up may be
faster than working with a repeatedly-setup hash table for a short
list.) But in any case, I'd start with some testing
whether and how often the algorithm actually revisits a
particular solution.
Let's assume that we have a data frame x which contains the columns job and income. Referring to the data in the frame normally requires the commands x$jobfor the data in the job column and x$income for the data in the income column.
However, using the command attach(x) permits to do away with the name of the data frame and the $ symbol when referring to the same data. Consequently, x$job becomes job and x$income becomes income in the R code.
The problem is that many experts in R advise NOT to use the attach() command when coding in R.
What is the main reason for that? What should be used instead?
When to use it:
I use attach() when I want the environment you get in most stats packages (eg Stata, SPSS) of working with one rectangular dataset at a time.
When not to use it:
However, it gets very messy and code quickly becomes unreadable when you have several different datasets, particularly if you are in effect using R as a crude relational database, where different rectangles of data, all relevant to the problem at hand and perhaps being used in various ways of matching data from the different rectangles, have variables with the same name.
The with() function, or the data= argument to many functions, are excellent alternatives to many instances where attach() is tempting.
Another reason not to use attach: it allows access to the values of columns of a data frame for reading (access) only, and as they were when attached. It is not a shorthand for the current value of that column. Two examples:
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> dist <- 0.3048 * dist
> # convert speed to meters per second
> speed <- 0.44707 * speed
> # compute a meaningless time
> time <- dist / speed
> # check our work
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
No changes were made to the cars data set even though dist and speed were assigned to.
If explicitly assigned back to the data set...
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> cars$dist <- 0.3048 * dist
> # convert speed to meters per second
> cars$speed <- 0.44707 * speed
> # compute a meaningless time
> cars$time <- dist / speed
> # compute meaningless time being explicit about using values in cars
> cars$time2 <- cars$dist / cars$speed
> # check our work
> head(cars)
speed dist time time2
1 1.78828 0.6096 0.5000000 0.3408862
2 1.78828 3.0480 2.5000000 1.7044311
3 3.12949 1.2192 0.5714286 0.3895842
4 3.12949 6.7056 3.1428571 2.1427133
5 3.57656 4.8768 2.0000000 1.3635449
6 4.02363 3.0480 1.1111111 0.7575249
the dist and speed that are referenced in computing time are the original (untransformed) values; the values of cars$dist and cars$speed when cars was attached.
I think there's nothing wrong with using attach. I myself don't use it (then again, I love animals, but don't keep any, either). When I think of attach, I think long term. Sure, when I'm working with a script I know it inside and out. But in one week's time, a month or a year when I go back to the script, I find the overheads with searching where a certain variable is from, just too expensive. A lot of methods have the data argument which makes calling variables pretty easy (sensulm(x ~ y + z, data = mydata)). If not, I find the usage of with to my satisfaction.
In short, in my book, attach is fine for short quick data exploration, but for developing scripts that I or other might want to use, I try to keep my code as readable (and transferable) as possible.
If you execute attach(data) multiple time, eg, 5 times, then you can see (with the help of search()) that your data has been attached 5 times in the workspace environment. So if you de-attach (detach(data)) it once, there'll still be data present 4 times in the environment. Hence, with()/within() are better options. They help create a local environment containing that object and you can use it without creating any confusions.
Let's assume that we have a data frame x which contains the columns job and income. Referring to the data in the frame normally requires the commands x$jobfor the data in the job column and x$income for the data in the income column.
However, using the command attach(x) permits to do away with the name of the data frame and the $ symbol when referring to the same data. Consequently, x$job becomes job and x$income becomes income in the R code.
The problem is that many experts in R advise NOT to use the attach() command when coding in R.
What is the main reason for that? What should be used instead?
When to use it:
I use attach() when I want the environment you get in most stats packages (eg Stata, SPSS) of working with one rectangular dataset at a time.
When not to use it:
However, it gets very messy and code quickly becomes unreadable when you have several different datasets, particularly if you are in effect using R as a crude relational database, where different rectangles of data, all relevant to the problem at hand and perhaps being used in various ways of matching data from the different rectangles, have variables with the same name.
The with() function, or the data= argument to many functions, are excellent alternatives to many instances where attach() is tempting.
Another reason not to use attach: it allows access to the values of columns of a data frame for reading (access) only, and as they were when attached. It is not a shorthand for the current value of that column. Two examples:
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> dist <- 0.3048 * dist
> # convert speed to meters per second
> speed <- 0.44707 * speed
> # compute a meaningless time
> time <- dist / speed
> # check our work
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
No changes were made to the cars data set even though dist and speed were assigned to.
If explicitly assigned back to the data set...
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> cars$dist <- 0.3048 * dist
> # convert speed to meters per second
> cars$speed <- 0.44707 * speed
> # compute a meaningless time
> cars$time <- dist / speed
> # compute meaningless time being explicit about using values in cars
> cars$time2 <- cars$dist / cars$speed
> # check our work
> head(cars)
speed dist time time2
1 1.78828 0.6096 0.5000000 0.3408862
2 1.78828 3.0480 2.5000000 1.7044311
3 3.12949 1.2192 0.5714286 0.3895842
4 3.12949 6.7056 3.1428571 2.1427133
5 3.57656 4.8768 2.0000000 1.3635449
6 4.02363 3.0480 1.1111111 0.7575249
the dist and speed that are referenced in computing time are the original (untransformed) values; the values of cars$dist and cars$speed when cars was attached.
I think there's nothing wrong with using attach. I myself don't use it (then again, I love animals, but don't keep any, either). When I think of attach, I think long term. Sure, when I'm working with a script I know it inside and out. But in one week's time, a month or a year when I go back to the script, I find the overheads with searching where a certain variable is from, just too expensive. A lot of methods have the data argument which makes calling variables pretty easy (sensulm(x ~ y + z, data = mydata)). If not, I find the usage of with to my satisfaction.
In short, in my book, attach is fine for short quick data exploration, but for developing scripts that I or other might want to use, I try to keep my code as readable (and transferable) as possible.
If you execute attach(data) multiple time, eg, 5 times, then you can see (with the help of search()) that your data has been attached 5 times in the workspace environment. So if you de-attach (detach(data)) it once, there'll still be data present 4 times in the environment. Hence, with()/within() are better options. They help create a local environment containing that object and you can use it without creating any confusions.
Let's assume that we have a data frame x which contains the columns job and income. Referring to the data in the frame normally requires the commands x$jobfor the data in the job column and x$income for the data in the income column.
However, using the command attach(x) permits to do away with the name of the data frame and the $ symbol when referring to the same data. Consequently, x$job becomes job and x$income becomes income in the R code.
The problem is that many experts in R advise NOT to use the attach() command when coding in R.
What is the main reason for that? What should be used instead?
When to use it:
I use attach() when I want the environment you get in most stats packages (eg Stata, SPSS) of working with one rectangular dataset at a time.
When not to use it:
However, it gets very messy and code quickly becomes unreadable when you have several different datasets, particularly if you are in effect using R as a crude relational database, where different rectangles of data, all relevant to the problem at hand and perhaps being used in various ways of matching data from the different rectangles, have variables with the same name.
The with() function, or the data= argument to many functions, are excellent alternatives to many instances where attach() is tempting.
Another reason not to use attach: it allows access to the values of columns of a data frame for reading (access) only, and as they were when attached. It is not a shorthand for the current value of that column. Two examples:
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> dist <- 0.3048 * dist
> # convert speed to meters per second
> speed <- 0.44707 * speed
> # compute a meaningless time
> time <- dist / speed
> # check our work
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
No changes were made to the cars data set even though dist and speed were assigned to.
If explicitly assigned back to the data set...
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> cars$dist <- 0.3048 * dist
> # convert speed to meters per second
> cars$speed <- 0.44707 * speed
> # compute a meaningless time
> cars$time <- dist / speed
> # compute meaningless time being explicit about using values in cars
> cars$time2 <- cars$dist / cars$speed
> # check our work
> head(cars)
speed dist time time2
1 1.78828 0.6096 0.5000000 0.3408862
2 1.78828 3.0480 2.5000000 1.7044311
3 3.12949 1.2192 0.5714286 0.3895842
4 3.12949 6.7056 3.1428571 2.1427133
5 3.57656 4.8768 2.0000000 1.3635449
6 4.02363 3.0480 1.1111111 0.7575249
the dist and speed that are referenced in computing time are the original (untransformed) values; the values of cars$dist and cars$speed when cars was attached.
I think there's nothing wrong with using attach. I myself don't use it (then again, I love animals, but don't keep any, either). When I think of attach, I think long term. Sure, when I'm working with a script I know it inside and out. But in one week's time, a month or a year when I go back to the script, I find the overheads with searching where a certain variable is from, just too expensive. A lot of methods have the data argument which makes calling variables pretty easy (sensulm(x ~ y + z, data = mydata)). If not, I find the usage of with to my satisfaction.
In short, in my book, attach is fine for short quick data exploration, but for developing scripts that I or other might want to use, I try to keep my code as readable (and transferable) as possible.
If you execute attach(data) multiple time, eg, 5 times, then you can see (with the help of search()) that your data has been attached 5 times in the workspace environment. So if you de-attach (detach(data)) it once, there'll still be data present 4 times in the environment. Hence, with()/within() are better options. They help create a local environment containing that object and you can use it without creating any confusions.