Efficient search & update, data tables or sparse matrix - R

Efficient search & update, data tables or sparse matrix - R - r

I am trying to find the most efficient way to repeatedly search for combinations of two variables in a reference table. The problem is based on an implementation of a hill climbing algorithm with annealing step size and, as such, adds a lot of complexity to the problem.
To explain, say I have two variables A and B that I want to optimise. I start with 100 combinations of these variables that I will iterate through
set.seed(100)
A_start <- sample(1000,10,rep=F)
B_start <- sample(1000,10,rep=F)
A_B_starts<-expand.grid(A = A_start,
B = B_start)
head(A_B_starts)
A B
1 714 823
2 503 823
3 358 823
4 624 823
5 985 823
6 718 823
For each of these start combinations, I want to use their immediate neighbours in a predictive model and if their error is less than that of the start combination, continue in that direction. This is repeated until a max number of iterations is hit or the error increases (standard hill climbing). I, however, do not want to recheck combinations I have already looked at so, to do this I want to use a reference table to store checked combinations. I then check each time I generate the immediate neighbours if they are in the reference table before running the predictive model. Any that are present are simply removed. More complexity is added because I want the step size that generates the immediate neighbours to be annealing; get smaller over time. I have implemented this using data.tables
max_iterations <-1e+06
#Set max size so efficient to add new combinations, max size is 100 start points by max iterations allowed
ref <-data.table(A=numeric(),
B=numeric(),
key=c("A","B"))[1:(100*max_iterations)]
ref
A B
1: NA NA
2: NA NA
3: NA NA
4: NA NA
5: NA NA
---
99999996: NA NA
99999997: NA NA
99999998: NA NA
99999999: NA NA
100000000: NA NA
So the loop to actually go through the problem
step_A <- 5
step_B <- 5
visited_counter <- 1L
for(start_i in 1:nrow(A_B_starts)){
initial_error <- get.error.pred.model(A_B_starts[1,])
A <-A_B_starts[1,1]
B <-A_B_starts[1,2]
#Add start i to checked combinations
set(ref, i=visited_counter, j="A", value=A)
set(ref, i=visited_counter, j="B", value=B)
visited_counter <- visited_counter+1L
iterations <- 1
while(iterations<max_iterations){
#Anneal step
decay_A = step_A / iterations
decay_B = step_B / iterations
step_A <- step_A * 1/(1+decay_A*iterations)
step_B <- step_B * 1/(1+decay_B*iterations)
#Get neighbours to check
to_visit_A <- c(A+step_A,A-step_A)
to_visit_B <- c(B+step_B,B-step_B)
to_visit <- setDT(expand.grid("A"=to_visit_A,"B" = to_visit_B),
key=c("A","B"))
#Now check if any combination have been checked before and remove if so
#set key for efficient searching - need to reset in loop as you are updating values in datatable
setkey(ref,A,B)
prev_visited<-ref[to_visit,nomatch=0L]
to_visit<-to_visit[!prev_visited]
#Run model on remaining combinations and if error reducing continue
best_neighbour <- get.min.error.pred.model(to_visit)
if(best_neighbour$error<initial_error){
initial_error <- best_neighbour_error
A <- best_neighbour$A
B <- best_neighbour$B
}
else{
iterations <- max_iterations
}
#Add all checked to reference and also update the number of iterations
for(visit_i in 1L:nrow(to_visit)){
#This will reset the key of the data table
set(ref, i=visited_counter, j="A", value=to_visit[visit_i,1])
set(ref, i=visited_counter, j="B", value=to_visit[visit_i,2])
visited_neighbour_counter <- visited_counter+1L
iterations <- iterations+1
}
}
}
The problem with this approach is that I have to reset the key each loop iteration as a new combination has been added to ref which makes it very slow:
setkey(ref,A,B)
prev_visited<-ref[to_visit,nomatch=0L]
to_visit<-to_visit[!prev_visited]
Also, the reason I mention the annealing is because I had another idea to use a sparse matrix; Matrix to hold indicators of pairs already checked which would allow very quick checks
require(Matrix)
#Use a sparse matrix for efficient search and optimum RAM usage
sparse_matrix <- sparseMatrix(A = 1:(100*1e+06),
B = 1:(100*1e+06) )
However, since the step size is variable i.e. A/B can hold any value with increasingly small intervals, I don't know how to initialise appropriate values of A and B in the sparse matrix to capture all possible combinations checked?

(Not really an answer, but too long for a comment.)
If the number of possible solutions is huge, it might
be impractical or impossible to store them all. What
is more, the fastest way to look up a single solution
is generally a hashtable; but setting up the hashtable
is slow, so you might not gain much (your objective
function needs to be more expensive that this
set-up/look-up-oberhead). Depending on the problem,
much of this storing solutions might be a waste; the
algorithm may never revisit them. An alternative
suggestion might be a first-in/first-out data
structure, which simply stores the last N solutions
that have been visited. (Even a linear look-up may be
faster than working with a repeatedly-setup hash table for a short
list.) But in any case, I'd start with some testing
whether and how often the algorithm actually revisits a
particular solution.

Related

R comparing unequal vectors with inequality

I have two single vector data frames of unequal length
aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
For each observation in aa I want to count the number of instances bb is less than aa
My result:
bb<aa
1 1
2 7
3 9
I have been able to do it two ways by creating a function and using apply, but my datasets are large and I let one run all night without end.
What I have:
fun1<-function(a,b){k<-colSums(b<a)
k<-k*.000058242}
system.time(replicate(5000,data.frame(apply(aa,1,fun1,b=bb))))
user system elapsed
3.813 0.011 3.883
Secondly,
fun2<-function(a,b){k<-length(which(b<a))
k<-k*.000058242}
system.time(replicate(5000,data.frame(apply(aa,1,fun2,b=bb))))
user system elapsed
3.648 0.006 3.664
The second function is slightly faster in all my tests, but I let the first run all night on a dataset where bb>1.7m and aa>160k
I found this post, and have tried using with() but cannot seem to get it to work, also tried a for loop without success.
Any help or direction is appreciated.
Thank you!

aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
sapply(aa[[1]],function(x)sum(bb[[1]]<x))
# [1] 1 7 9
Some more realistic examples:
n <- 1.6e3
bb <- sample(1:n,1.7e6,replace=T)
aa <- 1:n
system.time(sapply(aa,function(x)sum(bb<x)))
# user system elapsed
# 14.63 2.23 16.87
n <- 1.6e4
bb <- sample(1:n,1.7e6,replace=T)
aa <- 1:n
system.time(sapply(aa,function(x)sum(bb<x)))
# user system elapsed
# 148.77 18.11 167.26
So with length(aa) = 1.6e4 this takes about 2.5 min (on my system), and the process scales as O(length(aa)) - no surprise there. Therefore, with your full dataset, it should run in about 25 min. Still kind of slow. Maybe someone else will come up with a better way.

My original post I had been looking for the number of times bb
So in my example
aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
x<-ecdf(bb[,1])
x(2)
[1] 0.2
x(12)
[1] 0.7
x(35)
[1] 0.9
To get the answers in my original post I would need to multiply by the number of data points within bb, in this instance 10. Although the first one is not the same because in my original post I had stated bb
I am dealing with large datasets of land elevation and water elevation over 1 million data points for each, but in the end I am creating an inundation curve. I want to know how much land will be inundated at a water levels given exceedance probability.
So using the above ecdf() function on all 1 million data points would still be time consuming, but I realized I do not need all the data points just enough to create my curve.
So I applied the ecdf() function to the entire land data set, but then created an elevation sequence of the water large enough to create the curve that I needed, but small enough that it could be computed rapidly.
land_elevation <- data.frame(rnorm(1e6))
water_elevation<- data.frame(rnorm(1e6))
cdf_land<- ecdf(land_elevation[,1])
elevation_seq <- seq(from = min(water_elevation[,1]), to = max(water_elevation[,1]), length.out = 1000)
land <- sapply(elevation_seq, cdf_land)
My results are the same, but they are much faster.

For loop inside a for loop? in R

I am new to R and am trying create a new dataframe of bootstrapped resamples of groups of different sizes. My dataframe has 6 variables and a group designation, and there are 128 groups of different Ns. Here is an example of my data:
head(PhenoM2)
ID Name PhenoNames Group HML RML FML TML FHD BIB
1 378607 PaleoAleut PaleoAleut 1 323.5 248.75 434.50 355.75 46.84 NA
2 378664 PaleoAleut PaleoAleut 1 NA 238.50 441.50 353.00 45.83 277.0
3 378377 PaleoAleut PaleoAleut 1 309.5 227.75 419.00 332.25 46.39 284.0
4 378463 PaleoAleut PaleoAleut 1 283.5 228.75 397.75 331.00 44.37 255.5
5 378602 PaleoAleut PaleoAleut 1 279.5 230.00 393.00 329.50 45.93 265.0
6 378610 PaleoAleut PaleoAleut 1 307.5 234.25 419.50 338.50 43.98 271.5
Pulling from this question - bootstrap resampling for hierarchical/multilevel data - and taking some advice from others (thanks!) I wrote the code:
resample.M <- NULL
for(i in 1000){
groups <- unique(PhenoM2$"Group")
for(ii in 1:128)
data.i.ii <- PhenoM2[PhenoM2$"Group"==groups[ii],]
resample.M[i] <- data.i.ii[sample(1:nrow(data.i.ii),replace=T),]
}
Unfortunately, this gives me the warning:
In resample.M[i] <- data.i.ii[sample(1:nrow(data.i.ii), replace = T),:
number of items to replace is not a multiple of replacement length
Which I understand, since each of the 128 groups has a different N and none of it is a multiple of 1000. I put in resample.M[i] to try and accumulate all of the 1000x resamples of the 128 groups into a single database, and I'm pretty sure the problem is here.
Nearly all of the examples of for loops I've read create a vector database - numeric(1000) - then plug in the information, but since I'm wanting all of the data (which include factors, integers, and numerics) this doesn't work. I tried making a matrix to put the info in (there are 2187 unique individuals in the dataframe):
resample.M <- matrix(ncol=2187000,nrow=10)
But it's giving me the same warning.
So, since I'm sure I'm missing something basic here, I have three questions:
How can I get this code to resample all of the groups (with replacement and based on their individual Ns)?
How can I get this code to repeat this resampling 1000x?
How can I get the resamples of every group into the same database?
Thank you so much for your insight and expertise!

I think you may have wanted to use double square bracket, to store the results in a list, i.e. resample.M[[i]] <- .... Apart from that it makes more sense to write PhenoM2$Group than PhenoM2$"Group" and also groups <- unique(PhenoM2$Group) can go outside of your for loop since you only need to compute it once. Also replace 1:128 by 1:length(groups) or seq_along(groups), so that you don't need to hard code the length of the vector.
Because you will often need to operate on data frames grouped by some variable, I suggest you familiarise yourself with a package designed to do that, rather than using for loops, which can be very slow. The best one for a beginner in R may be plyr, which has an easy syntax (although there are many possibilities, including the slightly more "advanced" packages like dplyr and data.table).
So for a subset d <- subset(PhenoM2, Group == 1), you already have the function you need to perform on it: function(d) d[sample(1:nrow(d), replace = TRUE),].
Now to go over all such subsets, perform this operation and then arrange the results in a new data frame named samples you do
samples <- ddply(PhenoM2, .(Group),
function(d) d[sample(1:nrow(d), replace = TRUE),])
So what remains is to iterate this 1000 or however many times you want. You can use a for loop for this, storing the results in a list. Note that you need to use double square bracket [[ to set elements of the list.
n <- 1000 # number of iterations
samples <- vector("list", n) # list of length n to store results
for (i in seq_along(samples))
samples[[i]] <- ddply(PhenoM2, .(Group),
function(d) d[sample(1:nrow(d), replace = TRUE),])
An alternative way would be to use the function replicate, that performs the same task many times.
Once you have done this, all resamples will be stored in a list. I am not sure what you mean by "How can I get the resamples of every group into the same database". If you want to group them in a single data frame, you do all.samples <- do.call(rbind, samples). In general, you can format your list of samples using do.call and lapply together with a function.

Lack of reproducibility between R and Excel for big data sets

I'm running R version 3.0.2 in RStudio and Excel 2011 for Mac OS X. I'm performing a quantile normalization between 4 sets of 45,015 values. Yes I do know about the bioconductor package, but my question is a lot more general. It could be any other computation. The thing is, when I perform the computation (1) "by hand" in Excel and (2) with a program I wrote from scratch in R, I get highly similar, yet not identical results. Typically, the values obtained with (1) and (2) would differ by less than 1.0%, although sometimes more.
Where is this variation likely to come from, and what should I be aware of concerning number approximations in R and/or Excel? Does this come from a lack of float accuracy in either one of these programs? How can I avoid this?
[EDIT]
As was suggested to me in the comments, this may be case-specific. To provide some context, I described methods (1) and (2) below in detail using test data with 9 rows. The four data sets are called A, B, C, D.
[POST-EDIT COMMENT]
When I perform this on a very small data set (test sample: 9 rows), the results in R and Excel do not differ. But when I apply the same code to the real data (45,015 rows), I get slight variation between R and Excel. I have no clue why that may be.
(2) R code:
dataframe A
Aindex A
1 2.1675e+05
2 9.2225e+03
3 2.7925e+01
4 7.5775e+02
5 8.0375e+00
6 1.3000e+03
7 8.0575e+00
8 1.5700e+02
9 8.1275e+01
dataframe B
Bindex B
1 215250.000
2 10090.000
3 17.125
4 750.500
5 8.605
6 1260.000
7 7.520
8 190.250
9 67.350
dataframe C
Cindex C
1 2.0650e+05
2 9.5625e+03
3 2.1850e+01
4 1.2083e+02
5 9.7400e+00
6 1.3675e+03
7 9.9325e+00
8 1.9675e+02
9 7.4175e+01
dataframe D
Dindex D
1 207500.0000
2 9927.5000
3 16.1250
4 820.2500
5 10.3025
6 1400.0000
7 120.0100
8 175.2500
9 76.8250
Code:
#re-order by ascending values
A <- A[order(A$A),, drop=FALSE]
B <- B[order(B$B),, drop=FALSE]
C <- C[order(C$C),, drop=FALSE]
D <- D[order(D$D),, drop=FALSE]
row.names(A) <- NULL
row.names(B) <- NULL
row.names(C) <- NULL
row.names(D) <- NULL
#compute average
qnorm <- data.frame(cbind(A$A,B$B,C$C,D$D))
colnames(qnorm) <- c("A","B","C","D")
qnorm$qnorm <- (qnorm$A+qnorm$B+qnorm$C+qnorm$D)/4
#replace original values by average values
A$A <- qnorm$qnorm
B$B <- qnorm$qnorm
C$C <- qnorm$qnorm
D$D <- qnorm$qnorm
#re-order by index number
A <- A[order(A$Aindex),,drop=FALSE]
B <- B[order(B$Bindex),,drop=FALSE]
C <- C[order(C$Cindex),,drop=FALSE]
D <- D[order(D$Dindex),,drop=FALSE]
row.names(A) <- NULL
row.names(B) <- NULL
row.names(C) <- NULL
row.names(D) <- NULL
(1) Excel
assign index numbers to each set.
re-order each set in ascending order: select the columns two by two and use Custom Sort... by A, B, C, or D:
calculate average=() over columns A, B, C, and D:
replace values in columns A, B, C, and D by those in the average column using Special Paste... > Values:
re-order everything according to the original index numbers:

if you use exactly the same algorithm you will get exactly the same results. not within 1% but to the 10th decimal. so you're not using the same algorithms. details probably won't change this general answer.
(or it could be a bug in excel or r but this is less likely)

Answering my own question!
It ended up being Excel's fault (well, kind of): at some point, either in the conversion from the original TAB-delimited file to CSV, or later on when I started copying and pasting stuff, the values got rounded up.
The original TAB-delimited files had 6 decimals, whereas the CSV files only had 2. I had been doing the analysis so far with quantile normalization done in Excel from the 6-decimal data, whereas I read the data from the CSV files for my quantile normalization function in R, hence the change.
For the above examples for R and Excel respectively, I used data coming from the same source, which is why I got the same results.
What would you suggest would be best now that I figured this out:
1/Change the title to let other clueless people know that this kind of thing can happen?
2/Consider this post useless and delete it?

Identifying duplicate columns in a dataframe

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.
My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))
This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):
summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))
If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?

How about:
testframe[!duplicated(as.list(testframe))]

You can do with lapply:
testframe[!duplicated(lapply(testframe, summary))]
summary summarizes the distribution while ignoring the order.
Not 100% but I would use digest if the data is huge:
library(digest)
testframe[!duplicated(lapply(testframe, digest))]

A nice trick that you can use is to transpose your data frame and then check for duplicates.
duplicated(t(testframe))

unique(testframe, MARGIN=2)
does not work, though I think it should, so try
as.data.frame(unique(as.matrix(testframe), MARGIN=2))
or if you are worried about numbers turning into factors,
testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]
which produces
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
7 24 79.9 M
8 25 81.1 M
9 26 81.2 F
10 27 81.8 M
11 28 82.8 F
12 29 83.5 M

It is probably best for you to first find the duplicate column names and treat them accordingly (for example summing the two, taking the mean, first, last, second, mode, etc... To find the duplicate columns:
names(df)[duplicated(names(df))]

What about just:
unique.matrix(testframe, MARGIN=2)

Actually you just would need to invert the duplicated-result in your code and could stick to using subset (which is more readable compared to bracket notation imho)
require(dplyr)
iris %>% subset(., select=which(!duplicated(names(.))))

Here is a simple command that would work if the duplicated columns of your data frame had the same names:
testframe[names(testframe)[!duplicated(names(testframe))]]

If the problem is that dataframes have been merged one time too many using, for example:
testframe2 <- merge(testframe, testframe, by = c('age'))
It is also good to remove the .x suffix from the column names. I applied it here on top of Mostafa Rezaei's great answer:
testframe2 <- testframe2[!duplicated(as.list(testframe2))]
names(testframe2) <- gsub('.x','',names(testframe2))

Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix I propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
The function
dataPreparation::which_are_bijection
which_are_in_double(testframe)
Which return 3 and 4 the columns that are duplicated in your example
Build a data set with wanted dimensions for performance tests
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
for (i in 1:12){
testframe = rbind(testframe,testframe)
}
# Result in 49152 rows
for (i in 1:5){
testframe = cbind(testframe,testframe)
}
# Result in 160 columns
The benchmark
To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times
benchmark(
which_are_in_double(testframe, verbose=FALSE),
duplicated(lapply(testframe, summary)),
duplicated(lapply(testframe, digest))
)
test replications elapsed
3 duplicated(lapply(testframe, digest)) 100 39.505
2 duplicated(lapply(testframe, summary)) 100 20.412
1 which_are_in_double(testframe, verbose = FALSE) 100 13.581
So which are bijection 3 to 1.5 times faster than other proposed solutions.
NB 1: I excluded from the benchmark the solution testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] because it was already 10 times slower with 12k rows.
NB 2: Please note, the way this data set is constructed we have a lot of duplicated columns which reduce the advantage of exponential search. With just a few duplicated columns, one would have much better performance for which_are_bijection and similar performances for other methods.

writing to global environment when running in parallel

I have a data.frame of cells, values and coordinates. It resides in the global environment.
> head(cont.values)
cell value x y
1 11117 NA -34 322
2 11118 NA -30 322
3 11119 NA -26 322
4 11120 NA -22 322
5 11121 NA -18 322
6 11122 NA -14 322
Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each cell can be calculated independently, screaming for parallel execution.
What my function actually does is check if there's a value for a specified cell number and if it's NA, it calculates it and inserts it in place of NA.
I can run my magic function (result is value for a corresponding cell) using apply family of functions and from within apply, I can read and write cont.values without a problem (it's in global environment).
Now, I want to run this in parallel (using snowfall) and I'm unable to read or write from/to this variable from individual core.
Question: What solution would be able to read/write from/to a dynamic variable residing in global environment from within worker (core) when executing a function in parallel. Is there a better approach of doing this?

The pattern of a central store that workers consult for values is implemented in the rredis package on CRAN. The idea is that the Redis server maintains a store of key-value pairs (your global data frame, re-implemented). Workers query the server to see if the value has been calculated (redisGet) and if not do the calculation and store it (redisSet) so that other workers can re-use it. Workers can be R scripts, so it's easy to expand the work force. It's a very nice alternative parallel paradigm. Here's an example that uses the notion of 'memoizing' each result. We have a function that is slow (sleeps for a second)
fun <- function(x) { Sys.sleep(1); x }
We write a 'memoizer' that returns a variant of fun that first checks to see if the value for x has already been calculated, and if so uses that
memoize <-
function(FUN)
{
force(FUN) # circumvent lazy evaluation
require(rredis)
redisConnect()
function(x)
{
key <- as.character(x)
val <- redisGet(key)
if (is.null(val)) {
val <- FUN(x)
redisSet(key, val)
}
val
}
}
We then memoize our function
funmem <- memoize(fun)
and go
> system.time(res <- funmem(10)); res
user system elapsed
0.003 0.000 1.082
[1] 10
> system.time(res <- funmem(10)); res
user system elapsed
0.001 0.001 0.040
[1] 10
This does require a redis server running outside R but very easy to install; see the documentation that comes with the rredis package.
A within-R parallel version might be
library(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
clusterEvalQ(cl, { require(rredis); redisConnect() })
tasks <- sample(1:5, 100, TRUE)
system.time(res <- parSapply(cl, tasks, funmem))

It will depend on what the function in question is, off course, but I'm afraid that snowfall won't be much of a help there. Thing is, you'll have to export the whole dataframe to the different cores (see ?sfExport) and still find a way to combine it. That kind of beats the whole purpose of changing the value in the global environment, as you probably want to keep memory use as low as possible.
You can dive into the low-level functions of snow to -kind of- get this to work. See following example :
#Some data
Data <- data.frame(
cell = 1:10,
value = sample(c(100,NA),10,TRUE),
x = 1:10,
y = 1:10
)
# A sample function
sample.func <- function(){
id <- which(is.na(Data$value)) # get the NA values
# this splits up the values from the dataframe in a list
# which will be passed to clusterApply later on.
parts <- lapply(clusterSplit(cl,id),function(i)Data[i,c("x","y")])
# Here happens the magic
Data$value[id] <<-
unlist(clusterApply(cl,parts,function(x){
x$x+x$y
}
))
}
#now we run it
require(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
sample.func()
stopCluster(cl)
> Data
cell value x y
1 1 100 1 1
2 2 100 2 2
3 3 6 3 3
4 4 8 4 4
5 5 10 5 5
6 6 12 6 6
7 7 100 7 7
8 8 100 8 8
9 9 18 9 9
10 10 20 10 10
You will still have to copy (part of) your data though to get it to the cores. But that will happen anyway when you call snowfall high level functions on dataframes, as snowfall uses the low-level function of snow anyway.
Plus, one shouldn't forget that if you change one value in a dataframe, the whole dataframe is copied in the memory as well. So you won't win that much by adding the values one by one when they come back from the cluster. You might want to try some different approaches and do some memory profiling as well.

I agree with Joris that you will need to copy your data to the other cores.
On the positive side, you don't have to worry about NA's being in the data or not, within the cores.
If your original data.frame is called cont.values:
nnaidx<-is.na(cont.values$value) #where is missing data originally
dfrnna<-cont.values[nnaidx,] #subset for copying to other cores
calcValForDfrRow<-function(dfrRow){return(dfrRow$x+dfrRow$y)}#or whatever pleases you
sfExport(dfrnna, calcValForDfrRow) #export what is needed to other cores
cont.values$value[nnaidx]<-sfSapply(seq(dim(dfrnna)[1]), function(i){calcValForDfrRow(dfrnna[i,])}) #sfSapply handles 'reordering', so works exactly as if you had called sapply
should work nicely (barring typos)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Efficient search & update, data tables or sparse matrix - R - r

Related

R comparing unequal vectors with inequality

For loop inside a for loop? in R

Lack of reproducibility between R and Excel for big data sets

Identifying duplicate columns in a dataframe

writing to global environment when running in parallel

Categories

Resources