clusterMap split list of data.frames

clusterMap split list of data.frames - r

I'm working with two lists of data.frames and currently run something similar to this (simplified version of what I'm doing):
df1 <- data.frame("a","a1","L","R","b","c",1,2,3,4)
df2 <- data.frame("a","a1","L","R","b","c",4,4,4,4,4,44)
df3 <- data.frame(7,7,7,7)
df4 <- data.frame(5,5,5,5,9,9)
L1 <- list(df1,df2)
L2 <- list(df3,df4)
myfun <- function(x,y) {
difa = rowSums(abs(x[c(T,F)] - x[c(F,T)]))
difb=sum(abs(as.numeric(y[-c(1:6)])[c(T,F)] - as.numeric(y[-c(1:6)])[c(F,T)]))
diff <- difa + difb
return(diff)
}
output1 <- mapply(myfun, x = L2, y = L1)
The same number of data frames are in each list and each dataframe from one list corresponds to the dataframe in the other list. The dataframes in one list contain a single row while the other dataframes in the second list contain a dynamic number of rows; hence the use of sum and rowSums. The number of numeric columns are also dynamic but always the same between corresponding dataframes.
I'm looking to use parallel processing to speed up the computation when dealing with 1-10 million dataframes per list. I tried the following:
library(parallel)
if(detectCores() > 1) {no_cores <- detectCores() - 1}
if(.Platform$OS.type == "unix") {ptype <- "FORK"}
cl <- makeCluster(no_cores, type = ptype)
clusterMap(cl, myfun, x = L2, y = L1)
stopCluster(cl)
However, due to the significant amount of data I'm using, this will quickly fill up the memory. I assume due to loading the entire lists of data frames in each cluster?? I'm new to parallel processing in R and have read that splitting up the data into chunks according to the number of cores available is required for some parallel functions that don't implement it automatically, so I tried the following which does not work:
library(parallel)
if(detectCores() > 1) {no_cores <- detectCores() - 1}
if(.Platform$OS.type == "unix") {ptype <- "FORK"}
cl <- makeCluster(no_cores, type = ptype)
output1 <- clusterMap(cl, myfun, x = split(L2, ceiling(seq_along(L2)/no_cores)), y = split(L1, ceiling(seq_along(L1)/no_cores)))
stopCluster(cl)
Can someone help a newbie out? Most of the information I've been reading about uses parApply/parLapply/etc. I was able to use mcmapply, but since it only uses forking, I cannot use it. My code has to run on both unix and windows systems; hence my testing for OS.type to set it at fork.
UPDATE: So I think it is working correctly in the sense that it's parsing out chunks to different clusters, but the data type is not playing nice with binary operators inside the clusters. Issue appears to be related to it becoming a list of lists of dataframes and being treated as non-numeric in the clusters.

Related

Improving the computational time of R for loops

Trying to run a spline function by rows (690075) in a dataframe (Camera1) with 4096 columns (each column represents a position on the x axis) where the input variable to the function is a column in another dataset of the same length (test$vr) using a for loop; but I am having serious computational time issues.
I have tried converting the dataframe to a matrix and storing the output in a list amongst others, but to no avail. I have to do this for 2 other dataframes (camera2,camera3) of the same size.
Code
# Note camera1 and test$vr are of the same length
# Initialize
final.data1 <- data.frame()
#new wavelength range
y1 <- round(seq(from = 4714 , to = 4900, length.out = 4096),3)
system.time({
for (i in 1:690075) {
w1 = (as.numeric(colnames(camera1[-1]))) * (1.0 + test$vr[i]/299792.458)
my.data1<-as.data.frame(t(splinefun(x = w1, y = camera1[i,][-1])(y1)))
colnames(my.data1)=y1
final.data1 <- bind_rows(final.data1, my.data1)
} })
Running on a Ubuntu box with 344GB ram and 30 core Intel(R) Xeon(R) CPU E5-2695 # 2.30GHz
Any suggestions would be greatly appreciated.
Thank you.

Without seeing the data it's not easy to optimize your code, but I would start with something along the lines of the following.
final.data1 <- matrix(nrow = 690075, ncol = 4096)
#new wavelength range
y1 <- round(seq(from = 4714 , to = 4900, length.out = 4096), 3)
system.time({
w1 <- (as.numeric(colnames(camera1[-1]))) * (1.0 + test$vr/299792.458)
for (i in 1:690075) {
my.data1 <- t(splinefun(x = w1[i], y = camera1[i, ][-1])(y1))
final.data1[i, ] <- my.data1
}
})
final.data1 <- as.data.frame(final.data1)
colnames(final.data1) <- y1
Explanation:
I start by defining an object of class matrix to hold the results. I believe I got the dimensions of your final data.frame right. This reduces the running time because
Matrices are much faster than data frames, they are just folded vectors and indexing is fast. Data frames, on the contrary, are lists that can hold all types of data, numeric, character, logical, other lists, etc., and therefore it's slow to access their members.
By reserving the result's full memory in one operation saves R's memory management routines a lot of work. To extend final.data1 every iteration through the loop, is very time consuming.
w1 is computed outside the loop, taking advantage of R's vectorized nature. Besides, you were repeating the computation of as.numeric(colnames(camera1[-1])) 690k times!
Test this code and if it doesn't produce the same final result, just say so and I will see if I can do something to debug it.

First remove all instructions that can be done once, and put them outside the for loop. For example: colnames and as.numeric.
Second, try to vectorize. It seems that the w1 calculation can be vectorized, and so estimated once outside the for loop, by just removing the [i].
Third, initialize the final.data1 to the final dimension. For each row added to this data.frame, R will create a new data.frame with one more row, then remove the previous data.frame. It will take long time. Thus, final.data1 <- matrix(NA, ncol=length(y1), nrow=NROW).
And finally, if you want to use more than one core, try to replace the for loop by the parallelized foreach loop. It is possible if all rows are independant:
require(foreach)
require(doSNOW)
cl <- makeCluster(25, type="FORK") # FORK not usable in Windows
registerDoSNOW(cl) # register the cluster
clusterExport(cl, c("objects", "needed", "by", "each", "iteration"), envir=environment()) # for example y1, w1 and camera1
final.data1<- foreach(i=icount(NROW), .combine=rbind, inorder=FALSE) %dopar%
{
# your R code
}
stopCluster(cl)

Convert R apply statement to lapply for parallel processing

I have the following R "apply" statement:
for(i in 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation))
{
matrix_of_sums[,i]<-
apply(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]],1,sum)
}
So, I have the following data structures:
simulation_results: A matrix with column names that identify every possible piece of desired simulation lookup data for 2000 simulations (rows).
dataframe_stuff_that_needs_lookup_from_simulation: Contains, among other items, fields whose values match the column names in the simulation_results data structure.
matrix_of_sums: When function is run, a 2000 row x 250,000 column (# of simulations x items being simulated) structure meant to hold simulation results.
So, the apply function is looking up the dataframe columns values for each row in a 250,000 data set, computing the sum, and storing it in the matrix_of_sums data structure.
Unfortunately, this processing takes a very long time. I have explored the use of rowsums as an alternative, and it has cut the processing time in half, but I would like to try multi-core processing to see if that cuts processing time even more. Can someone help me convert the code above to "lapply" from "apply"?
Thanks!

With base R parallel, try
library(parallel)
cl <- makeCluster(detectCores())
matrix_of_sums <- parLapply(cl, 1:nrow(dataframe_stuff_that_needs_lookup_from_simulation), function(i)
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]]))
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
You could also try foreach %dopar%
library(doParallel) # will load parallel, foreach, and iterators
cl <- makeCluster(detectCores())
registerDoParallel(cl)
matrix_of_sums <- foreach(i = 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation)) %dopar% {
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]])
}
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
I wasn't quite sure how you wanted your output at the end, but it looks like you're doing a cbind of each result. Let me know if you're expecting something else however.

without really having any applicable or sample data to go off of... the process would look like this:
Create a holding matrix(matrix_of_sums)
loop by row through variable table(dataframe_stuff_that_needs_lookup_from_simulation)
find matching indices within the simulation model(simulation_results)
bind the rowSums into the holding matrix(matrix of sums)
I recreated a sample set which is meaningless and produces identical results but should work for your data
# Holding matrix which will be our end-goal
msums <- matrix(nrow = 2000,ncol = 0)
# Loop
parallel::mclapply(1:nrow(ts_df), function(i){
# Store the row to its own variable for ease
d <- ts_df[i,]
# cbind the results using the global assignment operator `<<-`
msums <<- cbind(
msums,
rowSums(
sim_df[,which(colnames(sim_df) %in% colnames(d))]
))
}, mc.cores = parallel::detectCores(), mc.allow.recursive = TRUE)

Using by in parallel in R

I am a noob R programmer. I have written a code that needs to apply a function to a data frame split by factors. The data frame in itself contains about 1 million 324961 observations with 64376 factors in the variable that we use to slice the dataframe.
The code is as follows:
library("readstata13")
# Reading the Stata Data file into R
bod_fb <- read.dta13("BoD_nonmissing_fb.dta")
gen_fuzzy_blau <- function(bod_sample){
# Here we drop the Variables that are not required in creating the Fuzzy-Blau index
bod_sample <- as.data.frame(bod_sample)
bod_sample$tot_occur <- as.numeric(bod_sample$tot_occur)
bod_sample$caste1_occ <- as.numeric(bod_sample$caste1_occ)
bod_sample$caste2_occ <- as.numeric(bod_sample$caste2_occ)
bod_sample$caste3_occ <- as.numeric(bod_sample$caste3_occ)
bod_sample$caste4_occ <- as.numeric(bod_sample$caste4_occ)
# Calculating the Probabilites of a director belonging to a caste
bod_sample$caste1_occ <- (bod_sample$caste1_occ)/(bod_sample$tot_occur)
bod_sample$caste2_occ <- (bod_sample$caste2_occ)/(bod_sample$tot_occur)
bod_sample$caste3_occ <- (bod_sample$caste3_occ)/(bod_sample$tot_occur)
bod_sample$caste4_occ <- (bod_sample$caste4_occ)/(bod_sample$tot_occur)
#Dropping the Total Occurances column, as we do not need it anymore
bod_sample$tot_occur<- NULL
# Here we replace all the blanks with NA
bod_sample <- apply(bod_sample, 2, function(x) gsub("^$|^ $", NA, x))
bod_sample <- as.data.frame(bod_sample)
# Here we push all the NAs in the caste names and caste probabilities to the end of the row
# So if there are only two castes against a name, then they become caste1 and caste2
caste_list<-data.frame(bod_sample$caste1,bod_sample$caste2,bod_sample$caste3,bod_sample$caste4)
caste_list = as.data.frame(t(apply(caste_list,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
caste_list_prob<-data.frame(bod_sample$caste1_occ,bod_sample$caste2_occ,bod_sample$caste3_occ,bod_sample$caste4_occ)
caste_list_prob = as.data.frame(t(apply(caste_list_prob,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
# Here we write two functions: 1. gen_castelist
# 2. gen_caste_prob
# gen_castelist: This function takes the row number (serial number of the direcor)
# and returns the names of all the castes for which he has a non-zero
# probability.
# gen_caste_prob: This function takes the row number (serial number of the director)
# and returns the probability with which he belongs to the caste
#
gen_castelist <- function(x){
y <- caste_list[x,]
y <- as.vector(y[!is.na(y)])
return(y)
}
gen_caste_prob <- function(x){
z <- caste_list_prob[x,]
z <- z[!is.na(z)]
z <- as.numeric(z)
return(z)
}
caste_ls <-list()
caste_prob_ls <- list()
for(i in 1:nrow(bod_sample))
{
caste_ls[[i]]<- gen_castelist(i)
caste_prob_ls[[i]]<- gen_caste_prob(i)
}
gridcaste <- expand.grid(caste_ls)
gridcaste <- data.frame(lapply(gridcaste, as.character), stringsAsFactors=FALSE)
gridcasteprob <- expand.grid(caste_prob_ls)
# Generating the Joint Probability
gridcasteprob$JP <- apply(gridcasteprob,1,prod)
# Generating the Similarity Index
gen_sim_index <- function(x){
x <- t(x)
a <- as.data.frame(table(x))
sim_index <- sum(a$Freq^2)/(sum(a$Freq))^2
return(sim_index)
}
gridcaste$sim_index <- apply(gridcaste,1,gen_sim_index)
# Generating fuzzyblau
gridcaste$fb <- gridcaste$sim_index * gridcasteprob$JP
fuzzy_blau_index <- sum(gridcaste$fb)
remove_list <- c("gridcaste","")
return(fuzzy_blau_index)
}
fuzzy_blau_output <- by(bod_fb,bod_fb$code_year,gen_fuzzy_blau)
# Saving the output as a dataframe with two columns
# Column 1 is the fuzzy blau index
# Column 2 is the code_year
code_year <- names(fuzzy_blau_output)
fuzzy_blau <- as.data.frame(as.vector(unlist(fuzzy_blau_output)))
names(fuzzy_blau) <- c("fuzzy_blau_index")
fuzzy_blau$code_year <- code_year
bod_fb <- merge(bod_fb,fuzzy_blau,by = "code_year")
save.dta13(bod_fb,"bod_fb_example.dta")
If the code is tl;dr, the summary is as follows:
I have a dataframe bod_fb. I need to apply the apply the gen_fuzzy_blau function on this dataframe by slicing the dataframe with factors of bod_fb$code_year.
Since the function is very huge sequential processing is taking more than a day and ends up in running out of memory. The function gen_fuzzy_blau returns a numeric variable fuzzy_blau_index for each code_year of the dataframe. I use by to apply the function on each slice. I wanted to know if there is a way to parallelly implement this code so that multiple instances of the function run at once on different slices of the dataframe. I did not find a by implementation for parallel package and I did not know how to pass the dataframes as iterators while using foreach and doParallel packages.
I have a AMD A8 laptop with 4GB RAM and windows 7 sp1 home basic. I have given 20GB as page file memory (this was after I got the memory error).
Thank you
EDIT 1: #milkmotel I have eliminated the redundancy in the code and removed the for loops, but a huge amount of time is being wasted in gen_sim_index in the function, I am using the proc.time()function to gauge the time that each part of the code is taking.
The function is supposed to the following to a row:
if we have a row (not a vector) say: a a b c the similarity index will be (2/4)^2 + (1/4)^2 + (1/4)^2 ie, summation of (no of occurences of each unique element of each row/total no of elements in the row)^2
I am unable to use the apply function directly on the row because each element in a row because each element in the row has different factors and table() does not output the frequencies properly.
What is an efficient way to code the gen_sim_index function?

You're saving your data 6 times over in 6 different variables. Try not doing that.
and it takes a day because you're running character indexing on a ridiculous amount of data with gsub().
Take your code out of your gen_fuzzy_blau function as it provides no value to wrap it up into one function rather than running it all independently. Then run it all one line at a time. If it takes too long to run, reconsider your method. Your code is incredibly inefficient.

incorrect number of dimensions error using parLapply

I am trying to parallelize some function on the 4 cores of my machine using parLapply.
My function defines two embedded loops which are meant to fill out some empty columns of a predefined matrix M.
However, when I run the code below I obtain the following error
2 nodes produced errors; first error: incorrect number of dimensions
Code:
require("parallel")
TheData<-list(E,T) # list of 2 matrices of different dimensions, T is longer and wider than E
myfunction <- function(TheData) {
for (k in 1:length(TheData[[1]][,1])) {
distance<-matrix(,nrow=length(TheData[[1]][,1]),ncol=1)
for (j in 1:length(TheData[[2]][,1])) {
distance[j]<-sqrt((as.numeric(TheData[[2]][j,1])-as.numeric(TheData[[1]][k,2]))^2+(as.numeric(TheData[[2]][j,2])-as.numeric(TheData[[1]][k,1]))^2)
}
index<-which(distance == min(distance))
M[k,4:9]<-c(as.numeric(TheData[[2]][index,1]),as.numeric(TheData[[2]][index,2]),as.numeric(TheData[[2]][index,3]),as.numeric(TheData[[2]][index,4]),as.numeric(TheData[[2]][index,5]),as.numeric(TheData[[2]][index,6]))
rm(distance)
gc()
}
}
n_cores <- 4
Cl = makeCluster(n_cores)
Results <- parLapplyLB(Cl, TheData, myfunction)
# I also tried: Results <- parLapply(Cl, TheData, myfunction)

In your example, parLapply is iterating over a list of matrices, and passing those matrices as the argument to "myfunction". However, "myfunction" seems to expect its argument to be a list of two matrices, and so an error occurs. I can reproduce that error with:
> E <- matrix(0, 4, 4)
> E[[1]][,1]
Error in E[[1]][, 1] : incorrect number of dimensions
I'm not sure what you're really trying to do, but with the current implementation of "myfunction", I would expect you to call parLapply with a list of lists containing two matrices, such as:
TheDataList <- list(list(A,B), list(C,D), list(E,F), list(G,H))
Passing this as the second argument to parLapply would result in "myfunction" being called four times, each time with a list containing two matrices.
But your example has another problem. It looks like you expect parLapply to modify the matrix "M" as a side-effect, but it can't. I think you should change "myfunction" to return a matrix. parLapply will return the matrices in a list which you can then bind together into the desired result.
Update
From your comment, I now believe that you essentially want to parallelize "myfunction". Here's my attempt to do that:
library(parallel)
cl <- makeCluster(4)
myfunction <- function(Exy) {
iM <- integer(nrow(Exy))
for (k in 1:nrow(Exy)) {
distance <- sqrt((Txy[,1] - Exy[k,2])^2 + (Txy[,2] - Exy[k,1])^2)
iM[k] <- which.min(distance)
}
iM
}
# Random example data for testing
T <- matrix(rnorm(150), 10)
E <- matrix(rnorm(120), 10)
# Only export the first two columns to T to the workers
Txy <- T[,1:2]
clusterExport(cl, c('Txy'))
# Parallelize "myfunction" by calling it in parallel on block rows of "E".
ExyList <- parallel:::splitRows(E[,1:2], length(cl))
iM <- do.call('c', clusterApply(cl, ExyList, myfunction))
# Update "M" using data from "T" indexed by "iM"
M <- matrix(0, nrow(T), 9) # more fake data
for (k in iM) {
M[k,4:9] <- T[k, 1:6]
}
print(M)
stopCluster(cl)
Notes:
I vectorized myfunction which should make it more efficient. Hopefully it's nearly correct.
I also modified myfunction to return a vector of indices into "T" to reduce the amount of data sent back to the master.
The splitRows function from the parallel package is used to split the first two columns of "E" into a list of submatrices.
splitRows isn't exported by parallel, so I used ':::'. If this offends you, then use the splitRows function from snow which is exported.
The first two columns of "T" are exported to each of the workers since each task requires the entire first two columns.
clusterApply is used rather than parLapply since we need to iterate over submatrices of E.

Using tapply, ave functions for ff vectors in R

I have been trying to use tapply, ave, ddply to create statistics by group of a variable (age, sex). I haven't been able to use above mentioned R commands successfully.
library("ff")
df <- as.ffdf(data.frame(a=c(1,1,1:3,1:5), b=c(10:1), c=(1:10)))
tapply(df$a, df$b, length)
The error message I get is
Error in as.vmode(value, vmode) :
argument "value" is missing, with no default
or
Error in byMean(df$b, df$a) : object 'index' not found

There is currently no tapply or ave for ff_vectors currently implemented in package ff.
But what you can do is use functionality in ffbase.
Let's elaborate on some bigger dataset
require(ffbase)
a <- ffrep.int(ff(1:100000), times=500) ## 50Mio records on disk - not in RAM
b <- ffrandom(n=length(a), rfun = runif)
c <- ffseq_len(length(a))
df <- ffdf(a = a, b = b, c = c) ## on disk
dim(df)
For your simple aggregation method, you can use binned_sum for which you can extract the length easily as follows. Mark that binned_sum needs an ff factor object in the bin, which can be obtained by doing as.character.ff as shown.
df$groupbyfactor <- as.character(df$a)
agg <- binned_sum(x=df$b, bin=df$groupbyfactor, nbins = length(levels(df$groupbyfactor)))
head(agg)
agg[, "count"]
For more complex aggregations you can use ffdfdply in ffbase. What I frequently do is combine it with some data.table statements like this:
require(data.table)
agg <- ffdfdply(df, split=df$groupbyfactor, FUN=function(x){
x <- as.data.table(x)
result <- x[, list(b.mean = mean(b), b.median = median(b), b.length = length(b), whatever = b[c == max(c)][1]), by = list(a)]
result <- as.data.frame(result)
result
})
class(agg)
aggg <- as.data.frame(agg) ## Puts the data in RAM!
This will put your data in RAM in chunks of groups of split elements based on which you can apply a function, like some data.table statements, which require your data to be in RAM. The result of all chunks based on which you applied the function is next combined in a new ffdf, so that you can further use it, or put it into RAM if your RAM allows that size.
The sizes of the chunks are controlled by getOption("ffbatchbytes"). So if you have more RAM, the better as it will allow you to get more data in each chunk in RAM.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

clusterMap split list of data.frames - r

Related

Improving the computational time of R for loops

Convert R apply statement to lapply for parallel processing

Using by in parallel in R

incorrect number of dimensions error using parLapply

Using tapply, ave functions for ff vectors in R

Categories

Resources