I'm trying to run multiple iterations of a function using a different subset of of my dataframe each time. In reality the function takes a very long time, so I want to split the iterations across multiple cores using mclapply. For each iteration I'm using sample to randomly select a subset of the dataframe, and this is inside the function I have written to give to mclapply. However, the results of each of the iterations in the output list are identical, suggesting that mclapply is not re-running the sample lines each time. This must be something to do with how I have written the code, any ideas where I have gone wrong?
Here is a reproducible example of a small dataset that runs quickly. You will notice that the 10 iterations in the d.val.all output list are identical, which is not what I am after.
library(bipartite)
library(doBy)
library(parallel)
# create dummy data
ecto.matrix1=data.frame(replicate(10,sample(0:80,81,rep=TRUE)),Species.mix.90=c(sample(c("R","M","S","B"),81,rep=TRUE)))
# set up the function
funct.resample.d <- function(i) {
RedSites <- row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="R"]
MountainSites <- row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="M"]
randomSilverSites <- sample(row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="S"],8,replace=F)
randomBlackSites <- sample(row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="B"],8,replace=F)
resampledSites <- c(RedSites,MountainSites,randomSilverSites,randomBlackSites) # make vector of the site names
matrix=ecto.matrix1[resampledSites,] # select only those rows from the resampled row names
matrix1 = matrix[,colSums(matrix[,-c(ncol(matrix))]) > 0] # drop cols that sum to 0
matrix2=summaryBy(matrix1[,-c(ncol(matrix1))]~Species.mix.90,data=matrix1,FUN=sum)
for (col in 1:ncol(matrix2)){
colnames(matrix2)[col] <- sub(".sum", "", colnames(matrix2)[col]) # remove the sum bit from the col names
}
row.names(matrix2)<-matrix2$Species.mix.90 # make row names
matrix2=subset(matrix2, select=-c(Species.mix.90)) # drop host col
d.val <- dfun(matrix2)$dprime
}
# run mclapply
reps=c(1:10)
d.val.all <- mclapply(reps, funct.resample.d, mc.cores = 10)
In case anyone else is having similar issues, I figured out that the problem was with the summaryBy function rather than sample. I replaced summaryBy with aggregate, and the randomization worked fine.
matrix2=aggregate(. ~ Species.mix.90, matrix1, sum)
Related
I have the following R "apply" statement:
for(i in 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation))
{
matrix_of_sums[,i]<-
apply(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]],1,sum)
}
So, I have the following data structures:
simulation_results: A matrix with column names that identify every possible piece of desired simulation lookup data for 2000 simulations (rows).
dataframe_stuff_that_needs_lookup_from_simulation: Contains, among other items, fields whose values match the column names in the simulation_results data structure.
matrix_of_sums: When function is run, a 2000 row x 250,000 column (# of simulations x items being simulated) structure meant to hold simulation results.
So, the apply function is looking up the dataframe columns values for each row in a 250,000 data set, computing the sum, and storing it in the matrix_of_sums data structure.
Unfortunately, this processing takes a very long time. I have explored the use of rowsums as an alternative, and it has cut the processing time in half, but I would like to try multi-core processing to see if that cuts processing time even more. Can someone help me convert the code above to "lapply" from "apply"?
Thanks!
With base R parallel, try
library(parallel)
cl <- makeCluster(detectCores())
matrix_of_sums <- parLapply(cl, 1:nrow(dataframe_stuff_that_needs_lookup_from_simulation), function(i)
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]]))
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
You could also try foreach %dopar%
library(doParallel) # will load parallel, foreach, and iterators
cl <- makeCluster(detectCores())
registerDoParallel(cl)
matrix_of_sums <- foreach(i = 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation)) %dopar% {
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]])
}
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
I wasn't quite sure how you wanted your output at the end, but it looks like you're doing a cbind of each result. Let me know if you're expecting something else however.
without really having any applicable or sample data to go off of... the process would look like this:
Create a holding matrix(matrix_of_sums)
loop by row through variable table(dataframe_stuff_that_needs_lookup_from_simulation)
find matching indices within the simulation model(simulation_results)
bind the rowSums into the holding matrix(matrix of sums)
I recreated a sample set which is meaningless and produces identical results but should work for your data
# Holding matrix which will be our end-goal
msums <- matrix(nrow = 2000,ncol = 0)
# Loop
parallel::mclapply(1:nrow(ts_df), function(i){
# Store the row to its own variable for ease
d <- ts_df[i,]
# cbind the results using the global assignment operator `<<-`
msums <<- cbind(
msums,
rowSums(
sim_df[,which(colnames(sim_df) %in% colnames(d))]
))
}, mc.cores = parallel::detectCores(), mc.allow.recursive = TRUE)
I'd like input on how my code below is structured. Would like to know if it needs to be organized in a different way to execute faster. Specifically, whether I need to be using foreach and dopar differently in the nested loops. Currently, the inner loop is the bulk of the work (ddply with between 1-8 breakdown variables, each of which has 10-200 levels), and that's what I have running in parallel. I left out the code details for simplicity.
Any ideas? My code, as organized below, does work, but it takes a few hours on a 6-core, 41gb machine. The dataset isn't that large (< 20k records).
for(m in 1:length(Predictors)){ # has up to three elements in the vector
# construct the dataframe based on the specified predictor
# subset the original dataframe based on the breakdown variables, outcome, predictor and covariates
for(l in 1:nrow(pairwisematrixReduced)){ # this has 1-6 rows;subset based on correct comparison groups
# some code here
cl <- makeCluster(detectCores())
registerDoParallel(cl)
for (i in 1:nrow(subsetting_table)){ # this table has about 50 rows
# this uses the columns specified by k in the glm; the prior columns will be used as breakdown variables
# up to 10 covariates
result[[length(result) + 1]] <- foreach(k = 11:17, .packages=c('plyr','reshape2', 'fastmatch')) %dopar% {
ddply(
df,
b, # vector of breakdown variables
function(x) {
# run a GLM and manipulate the output
,.parallel = TRUE) # close ddply
} # close k loop -- set of covariates
} # close i loop -- subsetting table
} #close l -- group combinations
} # close m loop - this is the pairwise predictor matrix
stopCluster(cl)
result <- unlist(result, recursive = FALSE)
tmp2<-do.call(rbind.fill, result)
Copied out of vignette("nested")
3 Using %:% with %dopar%
When parallelizing nested for loops, there is always a question of which loop to parallelize. The standard advice is...
You also are using foreach %dopar% along with ddply and .parallel=TRUE. With a six core processor (and presumably hyper threading) means the foreach block would start 12 environments and then the ddply would start 12 environments within each of those for 144 simultaneous environments. The foreach should be changed to %do% to be consistent with your questions text of running the inner loop in parallel. Or to make it cleaner, change both to foreach and use %dopar% for one loop and %:% for the other.
I am a noob R programmer. I have written a code that needs to apply a function to a data frame split by factors. The data frame in itself contains about 1 million 324961 observations with 64376 factors in the variable that we use to slice the dataframe.
The code is as follows:
library("readstata13")
# Reading the Stata Data file into R
bod_fb <- read.dta13("BoD_nonmissing_fb.dta")
gen_fuzzy_blau <- function(bod_sample){
# Here we drop the Variables that are not required in creating the Fuzzy-Blau index
bod_sample <- as.data.frame(bod_sample)
bod_sample$tot_occur <- as.numeric(bod_sample$tot_occur)
bod_sample$caste1_occ <- as.numeric(bod_sample$caste1_occ)
bod_sample$caste2_occ <- as.numeric(bod_sample$caste2_occ)
bod_sample$caste3_occ <- as.numeric(bod_sample$caste3_occ)
bod_sample$caste4_occ <- as.numeric(bod_sample$caste4_occ)
# Calculating the Probabilites of a director belonging to a caste
bod_sample$caste1_occ <- (bod_sample$caste1_occ)/(bod_sample$tot_occur)
bod_sample$caste2_occ <- (bod_sample$caste2_occ)/(bod_sample$tot_occur)
bod_sample$caste3_occ <- (bod_sample$caste3_occ)/(bod_sample$tot_occur)
bod_sample$caste4_occ <- (bod_sample$caste4_occ)/(bod_sample$tot_occur)
#Dropping the Total Occurances column, as we do not need it anymore
bod_sample$tot_occur<- NULL
# Here we replace all the blanks with NA
bod_sample <- apply(bod_sample, 2, function(x) gsub("^$|^ $", NA, x))
bod_sample <- as.data.frame(bod_sample)
# Here we push all the NAs in the caste names and caste probabilities to the end of the row
# So if there are only two castes against a name, then they become caste1 and caste2
caste_list<-data.frame(bod_sample$caste1,bod_sample$caste2,bod_sample$caste3,bod_sample$caste4)
caste_list = as.data.frame(t(apply(caste_list,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
caste_list_prob<-data.frame(bod_sample$caste1_occ,bod_sample$caste2_occ,bod_sample$caste3_occ,bod_sample$caste4_occ)
caste_list_prob = as.data.frame(t(apply(caste_list_prob,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
# Here we write two functions: 1. gen_castelist
# 2. gen_caste_prob
# gen_castelist: This function takes the row number (serial number of the direcor)
# and returns the names of all the castes for which he has a non-zero
# probability.
# gen_caste_prob: This function takes the row number (serial number of the director)
# and returns the probability with which he belongs to the caste
#
gen_castelist <- function(x){
y <- caste_list[x,]
y <- as.vector(y[!is.na(y)])
return(y)
}
gen_caste_prob <- function(x){
z <- caste_list_prob[x,]
z <- z[!is.na(z)]
z <- as.numeric(z)
return(z)
}
caste_ls <-list()
caste_prob_ls <- list()
for(i in 1:nrow(bod_sample))
{
caste_ls[[i]]<- gen_castelist(i)
caste_prob_ls[[i]]<- gen_caste_prob(i)
}
gridcaste <- expand.grid(caste_ls)
gridcaste <- data.frame(lapply(gridcaste, as.character), stringsAsFactors=FALSE)
gridcasteprob <- expand.grid(caste_prob_ls)
# Generating the Joint Probability
gridcasteprob$JP <- apply(gridcasteprob,1,prod)
# Generating the Similarity Index
gen_sim_index <- function(x){
x <- t(x)
a <- as.data.frame(table(x))
sim_index <- sum(a$Freq^2)/(sum(a$Freq))^2
return(sim_index)
}
gridcaste$sim_index <- apply(gridcaste,1,gen_sim_index)
# Generating fuzzyblau
gridcaste$fb <- gridcaste$sim_index * gridcasteprob$JP
fuzzy_blau_index <- sum(gridcaste$fb)
remove_list <- c("gridcaste","")
return(fuzzy_blau_index)
}
fuzzy_blau_output <- by(bod_fb,bod_fb$code_year,gen_fuzzy_blau)
# Saving the output as a dataframe with two columns
# Column 1 is the fuzzy blau index
# Column 2 is the code_year
code_year <- names(fuzzy_blau_output)
fuzzy_blau <- as.data.frame(as.vector(unlist(fuzzy_blau_output)))
names(fuzzy_blau) <- c("fuzzy_blau_index")
fuzzy_blau$code_year <- code_year
bod_fb <- merge(bod_fb,fuzzy_blau,by = "code_year")
save.dta13(bod_fb,"bod_fb_example.dta")
If the code is tl;dr, the summary is as follows:
I have a dataframe bod_fb. I need to apply the apply the gen_fuzzy_blau function on this dataframe by slicing the dataframe with factors of bod_fb$code_year.
Since the function is very huge sequential processing is taking more than a day and ends up in running out of memory. The function gen_fuzzy_blau returns a numeric variable fuzzy_blau_index for each code_year of the dataframe. I use by to apply the function on each slice. I wanted to know if there is a way to parallelly implement this code so that multiple instances of the function run at once on different slices of the dataframe. I did not find a by implementation for parallel package and I did not know how to pass the dataframes as iterators while using foreach and doParallel packages.
I have a AMD A8 laptop with 4GB RAM and windows 7 sp1 home basic. I have given 20GB as page file memory (this was after I got the memory error).
Thank you
EDIT 1: #milkmotel I have eliminated the redundancy in the code and removed the for loops, but a huge amount of time is being wasted in gen_sim_index in the function, I am using the proc.time()function to gauge the time that each part of the code is taking.
The function is supposed to the following to a row:
if we have a row (not a vector) say: a a b c the similarity index will be (2/4)^2 + (1/4)^2 + (1/4)^2 ie, summation of (no of occurences of each unique element of each row/total no of elements in the row)^2
I am unable to use the apply function directly on the row because each element in a row because each element in the row has different factors and table() does not output the frequencies properly.
What is an efficient way to code the gen_sim_index function?
You're saving your data 6 times over in 6 different variables. Try not doing that.
and it takes a day because you're running character indexing on a ridiculous amount of data with gsub().
Take your code out of your gen_fuzzy_blau function as it provides no value to wrap it up into one function rather than running it all independently. Then run it all one line at a time. If it takes too long to run, reconsider your method. Your code is incredibly inefficient.
I am having trouble optimising a piece of R code. The following example code should illustrate my optimisation problem:
Some initialisations and a function definition:
a <- c(10,20,30,40,50,60,70,80)
b <- c(“a”,”b”,”c”,”d”,”z”,”g”,”h”,”r”)
c <- c(1,2,3,4,5,6,7,8)
myframe <- data.frame(a,b,c)
values <- vector(length=columns)
solution <- matrix(nrow=nrow(myframe),ncol=columns+3)
myfunction <- function(frame,columns){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
The problematic for-loop looks like this:
columns = 6
for(i in 1:nrow(myframe){
values <- myfunction(as.matrix(myframe[i,]), columns)
values[columns+2] = i
values[columns+3] = myframe[i,3]
#more columns added with simple operations (i.e. sum)
solution <- rbind(solution,values)
#solution is a large matrix from outside the for-loop
}
The problem seems to be the rbind function. I frequently get error messages regarding the size of solution which seems to be to large after a while (more than 50 MB).
I want to replace this loop and the rbind with a list and lapply and/or foreach. I have started with converting myframeto a list.
myframe_list <- lapply(seq_len(nrow(myframe)), function(i) myframe[i,])
I have not really come further than this, although I tried applying this very good introduction to parallel processing.
How do I have to reconstruct the for-loop without having to change myfunction? Obviously I am open to different solutions...
Edit: This problem seems to be straight from the 2nd circle of hell from the R Inferno. Any suggestions?
The reason that using rbind in a loop like this is bad practice, is that in each iteration you enlarge your solution data frame and then copy it to a new object, which is a very slow process and can also lead to memory problems. One way around this is to create a list, whose ith component will store the output of the ith loop iteration. The final step is to call rbind on that list (just once at the end). This will look something like
my.list <- vector("list", nrow(myframe))
for(i in 1:nrow(myframe)){
# Call all necessary commands to create values
my.list[[i]] <- values
}
solution <- rbind(solution, do.call(rbind, my.list))
A bit to long for comment, so I put it here:
If columns is known in advance:
myfunction <- function(frame){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
apply(myframe, 2, myfunction)
If columns is not given via environment, you can use:
apply(myframe, 2, myfunction, columns) with your original myfunction definition.
I am trying to parallelize some function on the 4 cores of my machine using parLapply.
My function defines two embedded loops which are meant to fill out some empty columns of a predefined matrix M.
However, when I run the code below I obtain the following error
2 nodes produced errors; first error: incorrect number of dimensions
Code:
require("parallel")
TheData<-list(E,T) # list of 2 matrices of different dimensions, T is longer and wider than E
myfunction <- function(TheData) {
for (k in 1:length(TheData[[1]][,1])) {
distance<-matrix(,nrow=length(TheData[[1]][,1]),ncol=1)
for (j in 1:length(TheData[[2]][,1])) {
distance[j]<-sqrt((as.numeric(TheData[[2]][j,1])-as.numeric(TheData[[1]][k,2]))^2+(as.numeric(TheData[[2]][j,2])-as.numeric(TheData[[1]][k,1]))^2)
}
index<-which(distance == min(distance))
M[k,4:9]<-c(as.numeric(TheData[[2]][index,1]),as.numeric(TheData[[2]][index,2]),as.numeric(TheData[[2]][index,3]),as.numeric(TheData[[2]][index,4]),as.numeric(TheData[[2]][index,5]),as.numeric(TheData[[2]][index,6]))
rm(distance)
gc()
}
}
n_cores <- 4
Cl = makeCluster(n_cores)
Results <- parLapplyLB(Cl, TheData, myfunction)
# I also tried: Results <- parLapply(Cl, TheData, myfunction)
In your example, parLapply is iterating over a list of matrices, and passing those matrices as the argument to "myfunction". However, "myfunction" seems to expect its argument to be a list of two matrices, and so an error occurs. I can reproduce that error with:
> E <- matrix(0, 4, 4)
> E[[1]][,1]
Error in E[[1]][, 1] : incorrect number of dimensions
I'm not sure what you're really trying to do, but with the current implementation of "myfunction", I would expect you to call parLapply with a list of lists containing two matrices, such as:
TheDataList <- list(list(A,B), list(C,D), list(E,F), list(G,H))
Passing this as the second argument to parLapply would result in "myfunction" being called four times, each time with a list containing two matrices.
But your example has another problem. It looks like you expect parLapply to modify the matrix "M" as a side-effect, but it can't. I think you should change "myfunction" to return a matrix. parLapply will return the matrices in a list which you can then bind together into the desired result.
Update
From your comment, I now believe that you essentially want to parallelize "myfunction". Here's my attempt to do that:
library(parallel)
cl <- makeCluster(4)
myfunction <- function(Exy) {
iM <- integer(nrow(Exy))
for (k in 1:nrow(Exy)) {
distance <- sqrt((Txy[,1] - Exy[k,2])^2 + (Txy[,2] - Exy[k,1])^2)
iM[k] <- which.min(distance)
}
iM
}
# Random example data for testing
T <- matrix(rnorm(150), 10)
E <- matrix(rnorm(120), 10)
# Only export the first two columns to T to the workers
Txy <- T[,1:2]
clusterExport(cl, c('Txy'))
# Parallelize "myfunction" by calling it in parallel on block rows of "E".
ExyList <- parallel:::splitRows(E[,1:2], length(cl))
iM <- do.call('c', clusterApply(cl, ExyList, myfunction))
# Update "M" using data from "T" indexed by "iM"
M <- matrix(0, nrow(T), 9) # more fake data
for (k in iM) {
M[k,4:9] <- T[k, 1:6]
}
print(M)
stopCluster(cl)
Notes:
I vectorized myfunction which should make it more efficient. Hopefully it's nearly correct.
I also modified myfunction to return a vector of indices into "T" to reduce the amount of data sent back to the master.
The splitRows function from the parallel package is used to split the first two columns of "E" into a list of submatrices.
splitRows isn't exported by parallel, so I used ':::'. If this offends you, then use the splitRows function from snow which is exported.
The first two columns of "T" are exported to each of the workers since each task requires the entire first two columns.
clusterApply is used rather than parLapply since we need to iterate over submatrices of E.