Improving the computational time of R for loops - r

Trying to run a spline function by rows (690075) in a dataframe (Camera1) with 4096 columns (each column represents a position on the x axis) where the input variable to the function is a column in another dataset of the same length (test$vr) using a for loop; but I am having serious computational time issues.
I have tried converting the dataframe to a matrix and storing the output in a list amongst others, but to no avail. I have to do this for 2 other dataframes (camera2,camera3) of the same size.
Code
# Note camera1 and test$vr are of the same length
# Initialize
final.data1 <- data.frame()
#new wavelength range
y1 <- round(seq(from = 4714 , to = 4900, length.out = 4096),3)
system.time({
for (i in 1:690075) {
w1 = (as.numeric(colnames(camera1[-1]))) * (1.0 + test$vr[i]/299792.458)
my.data1<-as.data.frame(t(splinefun(x = w1, y = camera1[i,][-1])(y1)))
colnames(my.data1)=y1
final.data1 <- bind_rows(final.data1, my.data1)
} })
Running on a Ubuntu box with 344GB ram and 30 core Intel(R) Xeon(R) CPU E5-2695 # 2.30GHz
Any suggestions would be greatly appreciated.
Thank you.

Without seeing the data it's not easy to optimize your code, but I would start with something along the lines of the following.
final.data1 <- matrix(nrow = 690075, ncol = 4096)
#new wavelength range
y1 <- round(seq(from = 4714 , to = 4900, length.out = 4096), 3)
system.time({
w1 <- (as.numeric(colnames(camera1[-1]))) * (1.0 + test$vr/299792.458)
for (i in 1:690075) {
my.data1 <- t(splinefun(x = w1[i], y = camera1[i, ][-1])(y1))
final.data1[i, ] <- my.data1
}
})
final.data1 <- as.data.frame(final.data1)
colnames(final.data1) <- y1
Explanation:
I start by defining an object of class matrix to hold the results. I believe I got the dimensions of your final data.frame right. This reduces the running time because
Matrices are much faster than data frames, they are just folded vectors and indexing is fast. Data frames, on the contrary, are lists that can hold all types of data, numeric, character, logical, other lists, etc., and therefore it's slow to access their members.
By reserving the result's full memory in one operation saves R's memory management routines a lot of work. To extend final.data1 every iteration through the loop, is very time consuming.
w1 is computed outside the loop, taking advantage of R's vectorized nature. Besides, you were repeating the computation of as.numeric(colnames(camera1[-1])) 690k times!
Test this code and if it doesn't produce the same final result, just say so and I will see if I can do something to debug it.

First remove all instructions that can be done once, and put them outside the for loop. For example: colnames and as.numeric.
Second, try to vectorize. It seems that the w1 calculation can be vectorized, and so estimated once outside the for loop, by just removing the [i].
Third, initialize the final.data1 to the final dimension. For each row added to this data.frame, R will create a new data.frame with one more row, then remove the previous data.frame. It will take long time. Thus, final.data1 <- matrix(NA, ncol=length(y1), nrow=NROW).
And finally, if you want to use more than one core, try to replace the for loop by the parallelized foreach loop. It is possible if all rows are independant:
require(foreach)
require(doSNOW)
cl <- makeCluster(25, type="FORK") # FORK not usable in Windows
registerDoSNOW(cl) # register the cluster
clusterExport(cl, c("objects", "needed", "by", "each", "iteration"), envir=environment()) # for example y1, w1 and camera1
final.data1<- foreach(i=icount(NROW), .combine=rbind, inorder=FALSE) %dopar%
{
# your R code
}
stopCluster(cl)

Related

Way to parallelize QCA minimize() in R?

I'm using the R package QCA (https://cran.r-project.org/web/packages/QCA/index.html) for Qualitative Comparative Analysis. I want to be able to try out many different combinations, which is taking a very long time. On my faster CPU, trying all the options that I am interested in takes a little over 24 hours. R seems to be using only one of the cores available on my CPU and requires relatively little memory (just under 100MB). I am hoping someone has a good idea on how to speed up this process, perhaps through parallelization?
Here's what I'm doing:
Loading my data set (data), which is a CSV file with the outcome condition and all the options for my causal conditions. The causal conditions are in 4 groups A, B, C, and D. There are approx. 200 observations, i.e., rows in the data set.
Starting a log file with sink()
Creating a series of nested loops to generate each combination of causal conditions I want to examine.
Running the minimize() function within the nested loops. Specifically this looks like this:
for (a in causal_condition_group_A) {
for (b in causal_condition_group_B) {
for (c in causal_condition_group_C) {
for (d in causal_condition_group_D) {
minimize(data, outcome = my_outcome, conditions = paste0(a, ", ", b, ", ", c, ", ", d), ...)
}
}
}
}
The minimize function's conditions argument essentially takes a character vector as input and this is all my nested loops are creating. For example, a random conditions argument might read:
conditions = "causal_condition_A_87, causal_condition_B_2, causal_condition_C_42, causal_condition_D_219"
I tried several different things in an attempt to parallelize this approach, but so far I have not been successful. I tried experimenting with both parSapply and foreach %dopar%, but I am running into various problems. I either can't get the actual parallelization process to work properly or - and this is in some of my toy experiments - I am having trouble logging all the output, which is essential.
Please let me know if I can provide additional information to help clear things up! Thanks for your help!
EDIT:
I was able to create a working foreach() loop based on #HenrikB's advice, but I'm running into a different problem now.
Here's my test solution so far. It includes one less nested loop than I want in the final solution, but that's not important for now:
# SET QCA OUTCOME CONDITION
outcome = "c_outcome"
# LOAD LIBRARIES
library(doParallel)
library(QCA)
# CREATE CLUSTER FOR PARALLELIZATION
cores <- detectCores()
cl <- makeCluster(cores[1]-1, type = "PSOCK", outfile="")
registerDoParallel(cl)
# LOAD AND SET UP DATA
outcomecond <- read.csv("outcomes.csv", header=TRUE, row.names="ID")
causalcond <- read.csv("causal_conditions.csv", header=TRUE, row.names="ID")
data <- cbind(outcomecond[outcome], causalcond)
temp <- data[!is.na(data[outcome]), ] #keep only rows where outcome is not NA
# EXPORT CURRENT DATASET TO CLUSTERS
clusterExport(cl, "temp")
# CREATE CAUSAL CONDITION LISTS
causal_condition_group_A <- colnames(causalcond[, 1:99])
causal_condition_group_B <- colnames(causalcond[, 100:141])
causal_condition_group_C <- colnames(causalcond[, 142:183])
# EXPORT LIBRARIES TO CLUSTERS
clusterEvalQ(cl, library(doParallel))
clusterEvalQ(cl, library(QCA))
# START TIMER
start.time <- Sys.time()
# THREE NESTED FOREACH LOOPS (ONE FOR EACH CAUSAL CONDITION GROUP)
x <-
foreach(c=causal_condition_group_C, .combine='cbind') %:%
foreach(b=causal_condition_group_B, .combine='cbind') %:%
foreach(a=causal_condition_group_A, .combine='cbind') %dopar% {
tryCatch({
minimize(temp, outcome = outcome, conditions = paste0(a,",",b,",",c), n.cut = 1, incl.cut = 0.400, include = "?", details = TRUE, use.letters = TRUE)
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
# END TIMER
end.time <- Sys.time()
# PROGRAM RUNNING TIME:
print(end.time - start.time)
# END CLUSTER
stopCluster(cl)
When I run this as a sequential %do% loop, I need something like 3 GB of memory to store x. However, when I try to run this sequentially, the memory rises exponentially. Here's a screenshot from shortly before I gave up:
Screenshot of task manager
Does someone know why %dopar% is using so much more memory and is there a way to avoid this?
Could I, for example, be writing x to file once in a while and purge memory after I do this? x is a list that is of dimension 14 by number of iterations in the foreach() loops. Here is what one of the "columns" in x looks like:
result.98
tt List,11
options List,10
negatives Numeric,3
initials "~A*B"
PIchart TRUE
primes Integer,2
solution List,1
essential "~A*B"
inputcases "205,245,253,306,425,468,490,511,514,585,587,657,684,696,739,740,784,796"
pims List,1
IC List,4
numbers Numeric,4
SA List,1
call Expression

Convert R apply statement to lapply for parallel processing

I have the following R "apply" statement:
for(i in 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation))
{
matrix_of_sums[,i]<-
apply(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]],1,sum)
}
So, I have the following data structures:
simulation_results: A matrix with column names that identify every possible piece of desired simulation lookup data for 2000 simulations (rows).
dataframe_stuff_that_needs_lookup_from_simulation: Contains, among other items, fields whose values match the column names in the simulation_results data structure.
matrix_of_sums: When function is run, a 2000 row x 250,000 column (# of simulations x items being simulated) structure meant to hold simulation results.
So, the apply function is looking up the dataframe columns values for each row in a 250,000 data set, computing the sum, and storing it in the matrix_of_sums data structure.
Unfortunately, this processing takes a very long time. I have explored the use of rowsums as an alternative, and it has cut the processing time in half, but I would like to try multi-core processing to see if that cuts processing time even more. Can someone help me convert the code above to "lapply" from "apply"?
Thanks!
With base R parallel, try
library(parallel)
cl <- makeCluster(detectCores())
matrix_of_sums <- parLapply(cl, 1:nrow(dataframe_stuff_that_needs_lookup_from_simulation), function(i)
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]]))
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
You could also try foreach %dopar%
library(doParallel) # will load parallel, foreach, and iterators
cl <- makeCluster(detectCores())
registerDoParallel(cl)
matrix_of_sums <- foreach(i = 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation)) %dopar% {
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]])
}
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
I wasn't quite sure how you wanted your output at the end, but it looks like you're doing a cbind of each result. Let me know if you're expecting something else however.
without really having any applicable or sample data to go off of... the process would look like this:
Create a holding matrix(matrix_of_sums)
loop by row through variable table(dataframe_stuff_that_needs_lookup_from_simulation)
find matching indices within the simulation model(simulation_results)
bind the rowSums into the holding matrix(matrix of sums)
I recreated a sample set which is meaningless and produces identical results but should work for your data
# Holding matrix which will be our end-goal
msums <- matrix(nrow = 2000,ncol = 0)
# Loop
parallel::mclapply(1:nrow(ts_df), function(i){
# Store the row to its own variable for ease
d <- ts_df[i,]
# cbind the results using the global assignment operator `<<-`
msums <<- cbind(
msums,
rowSums(
sim_df[,which(colnames(sim_df) %in% colnames(d))]
))
}, mc.cores = parallel::detectCores(), mc.allow.recursive = TRUE)

Speed up for loop assigning data to matrix in R

I am simulating data and filling a matrix using a for loop in R. Currently the loop is running slower than I would like. I've done some work to vectorize some of the variables to improve the loops speed but it still taking some time. I believe the
mat[j,year] <- sum(vec==1)/x
part of the loop is slowing things down. I've looked into filling matrices more efficiently but could not find anything to help my current problem. Eventually this will be used as a part of a shiny app so all of variables I assign will need to be easily assigned different values.
Any advice to speed up the loop or more efficiently write this loop would be greatly appreciated.
Here is the loop:
#These variables are all specified because they need to change with different simulations
num.sims <- 20
time <- 50
mat <- matrix(nrow = num.sims, ncol = time)
x <- 1000
init <- 0.5*x
vec <- vector(length = x)
ratio <- 1
freq <- -0.4
freq.vec <- numeric(nrow(mat))
## start a loop
for (j in 1:num.sims) {
vec[1:init] <- 1; vec[(init+1):x] <- 2
year <- 2
freq.vec[j] <- sum(vec==1)/x
for (i in 1:(x*(time-1))) {
freq.1 <- sum(vec==1)/x; freq.2 <- 1 - freq.1
fit.ratio <- exp(freq*(freq.1-0.5) + log(ratio))
Pr.1 <- fit.ratio*freq.1/(fit.ratio*freq.1 + freq.2)
vec[ceiling(x*runif(1))] <- sample(c(1,2), 1, prob=c(Pr.1,1-Pr.1))
## record data
if (i %% x == 0) {
mat[j,year] <- sum(vec==1)/x
year <- year + 1
}}}
The inner loop is what is slowing you down. You're doing x number of iterations to update each cell in the matrix. Since each trip to modify vec depends on the previous iteration, this would be difficult to simplify. #Andrew Feierman is probably correct that this would benefit from being moved to C++, at least the four lines before the if statement.
Alternatively, this only takes 10-20 seconds to run. Unless you're going to scale this up or run it many times, it might not be worth the trouble to speed it up. If you do keep it as is, you could put a progress bar in Shiny to let the user know things are still working.
Depending on how often you will need to call this loop, it could be worth rewriting it in C++. R is built on C++, and any C++ will run many, many times faster than even efficient R code.
sourceCpp is a good package to start with: https://www.rdocumentation.org/packages/Rcpp/versions/0.12.11/topics/sourceCpp

clusterMap split list of data.frames

I'm working with two lists of data.frames and currently run something similar to this (simplified version of what I'm doing):
df1 <- data.frame("a","a1","L","R","b","c",1,2,3,4)
df2 <- data.frame("a","a1","L","R","b","c",4,4,4,4,4,44)
df3 <- data.frame(7,7,7,7)
df4 <- data.frame(5,5,5,5,9,9)
L1 <- list(df1,df2)
L2 <- list(df3,df4)
myfun <- function(x,y) {
difa = rowSums(abs(x[c(T,F)] - x[c(F,T)]))
difb=sum(abs(as.numeric(y[-c(1:6)])[c(T,F)] - as.numeric(y[-c(1:6)])[c(F,T)]))
diff <- difa + difb
return(diff)
}
output1 <- mapply(myfun, x = L2, y = L1)
The same number of data frames are in each list and each dataframe from one list corresponds to the dataframe in the other list. The dataframes in one list contain a single row while the other dataframes in the second list contain a dynamic number of rows; hence the use of sum and rowSums. The number of numeric columns are also dynamic but always the same between corresponding dataframes.
I'm looking to use parallel processing to speed up the computation when dealing with 1-10 million dataframes per list. I tried the following:
library(parallel)
if(detectCores() > 1) {no_cores <- detectCores() - 1}
if(.Platform$OS.type == "unix") {ptype <- "FORK"}
cl <- makeCluster(no_cores, type = ptype)
clusterMap(cl, myfun, x = L2, y = L1)
stopCluster(cl)
However, due to the significant amount of data I'm using, this will quickly fill up the memory. I assume due to loading the entire lists of data frames in each cluster?? I'm new to parallel processing in R and have read that splitting up the data into chunks according to the number of cores available is required for some parallel functions that don't implement it automatically, so I tried the following which does not work:
library(parallel)
if(detectCores() > 1) {no_cores <- detectCores() - 1}
if(.Platform$OS.type == "unix") {ptype <- "FORK"}
cl <- makeCluster(no_cores, type = ptype)
output1 <- clusterMap(cl, myfun, x = split(L2, ceiling(seq_along(L2)/no_cores)), y = split(L1, ceiling(seq_along(L1)/no_cores)))
stopCluster(cl)
Can someone help a newbie out? Most of the information I've been reading about uses parApply/parLapply/etc. I was able to use mcmapply, but since it only uses forking, I cannot use it. My code has to run on both unix and windows systems; hence my testing for OS.type to set it at fork.
UPDATE: So I think it is working correctly in the sense that it's parsing out chunks to different clusters, but the data type is not playing nice with binary operators inside the clusters. Issue appears to be related to it becoming a list of lists of dataframes and being treated as non-numeric in the clusters.

Using by in parallel in R

I am a noob R programmer. I have written a code that needs to apply a function to a data frame split by factors. The data frame in itself contains about 1 million 324961 observations with 64376 factors in the variable that we use to slice the dataframe.
The code is as follows:
library("readstata13")
# Reading the Stata Data file into R
bod_fb <- read.dta13("BoD_nonmissing_fb.dta")
gen_fuzzy_blau <- function(bod_sample){
# Here we drop the Variables that are not required in creating the Fuzzy-Blau index
bod_sample <- as.data.frame(bod_sample)
bod_sample$tot_occur <- as.numeric(bod_sample$tot_occur)
bod_sample$caste1_occ <- as.numeric(bod_sample$caste1_occ)
bod_sample$caste2_occ <- as.numeric(bod_sample$caste2_occ)
bod_sample$caste3_occ <- as.numeric(bod_sample$caste3_occ)
bod_sample$caste4_occ <- as.numeric(bod_sample$caste4_occ)
# Calculating the Probabilites of a director belonging to a caste
bod_sample$caste1_occ <- (bod_sample$caste1_occ)/(bod_sample$tot_occur)
bod_sample$caste2_occ <- (bod_sample$caste2_occ)/(bod_sample$tot_occur)
bod_sample$caste3_occ <- (bod_sample$caste3_occ)/(bod_sample$tot_occur)
bod_sample$caste4_occ <- (bod_sample$caste4_occ)/(bod_sample$tot_occur)
#Dropping the Total Occurances column, as we do not need it anymore
bod_sample$tot_occur<- NULL
# Here we replace all the blanks with NA
bod_sample <- apply(bod_sample, 2, function(x) gsub("^$|^ $", NA, x))
bod_sample <- as.data.frame(bod_sample)
# Here we push all the NAs in the caste names and caste probabilities to the end of the row
# So if there are only two castes against a name, then they become caste1 and caste2
caste_list<-data.frame(bod_sample$caste1,bod_sample$caste2,bod_sample$caste3,bod_sample$caste4)
caste_list = as.data.frame(t(apply(caste_list,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
caste_list_prob<-data.frame(bod_sample$caste1_occ,bod_sample$caste2_occ,bod_sample$caste3_occ,bod_sample$caste4_occ)
caste_list_prob = as.data.frame(t(apply(caste_list_prob,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
# Here we write two functions: 1. gen_castelist
# 2. gen_caste_prob
# gen_castelist: This function takes the row number (serial number of the direcor)
# and returns the names of all the castes for which he has a non-zero
# probability.
# gen_caste_prob: This function takes the row number (serial number of the director)
# and returns the probability with which he belongs to the caste
#
gen_castelist <- function(x){
y <- caste_list[x,]
y <- as.vector(y[!is.na(y)])
return(y)
}
gen_caste_prob <- function(x){
z <- caste_list_prob[x,]
z <- z[!is.na(z)]
z <- as.numeric(z)
return(z)
}
caste_ls <-list()
caste_prob_ls <- list()
for(i in 1:nrow(bod_sample))
{
caste_ls[[i]]<- gen_castelist(i)
caste_prob_ls[[i]]<- gen_caste_prob(i)
}
gridcaste <- expand.grid(caste_ls)
gridcaste <- data.frame(lapply(gridcaste, as.character), stringsAsFactors=FALSE)
gridcasteprob <- expand.grid(caste_prob_ls)
# Generating the Joint Probability
gridcasteprob$JP <- apply(gridcasteprob,1,prod)
# Generating the Similarity Index
gen_sim_index <- function(x){
x <- t(x)
a <- as.data.frame(table(x))
sim_index <- sum(a$Freq^2)/(sum(a$Freq))^2
return(sim_index)
}
gridcaste$sim_index <- apply(gridcaste,1,gen_sim_index)
# Generating fuzzyblau
gridcaste$fb <- gridcaste$sim_index * gridcasteprob$JP
fuzzy_blau_index <- sum(gridcaste$fb)
remove_list <- c("gridcaste","")
return(fuzzy_blau_index)
}
fuzzy_blau_output <- by(bod_fb,bod_fb$code_year,gen_fuzzy_blau)
# Saving the output as a dataframe with two columns
# Column 1 is the fuzzy blau index
# Column 2 is the code_year
code_year <- names(fuzzy_blau_output)
fuzzy_blau <- as.data.frame(as.vector(unlist(fuzzy_blau_output)))
names(fuzzy_blau) <- c("fuzzy_blau_index")
fuzzy_blau$code_year <- code_year
bod_fb <- merge(bod_fb,fuzzy_blau,by = "code_year")
save.dta13(bod_fb,"bod_fb_example.dta")
If the code is tl;dr, the summary is as follows:
I have a dataframe bod_fb. I need to apply the apply the gen_fuzzy_blau function on this dataframe by slicing the dataframe with factors of bod_fb$code_year.
Since the function is very huge sequential processing is taking more than a day and ends up in running out of memory. The function gen_fuzzy_blau returns a numeric variable fuzzy_blau_index for each code_year of the dataframe. I use by to apply the function on each slice. I wanted to know if there is a way to parallelly implement this code so that multiple instances of the function run at once on different slices of the dataframe. I did not find a by implementation for parallel package and I did not know how to pass the dataframes as iterators while using foreach and doParallel packages.
I have a AMD A8 laptop with 4GB RAM and windows 7 sp1 home basic. I have given 20GB as page file memory (this was after I got the memory error).
Thank you
EDIT 1: #milkmotel I have eliminated the redundancy in the code and removed the for loops, but a huge amount of time is being wasted in gen_sim_index in the function, I am using the proc.time()function to gauge the time that each part of the code is taking.
The function is supposed to the following to a row:
if we have a row (not a vector) say: a a b c the similarity index will be (2/4)^2 + (1/4)^2 + (1/4)^2 ie, summation of (no of occurences of each unique element of each row/total no of elements in the row)^2
I am unable to use the apply function directly on the row because each element in a row because each element in the row has different factors and table() does not output the frequencies properly.
What is an efficient way to code the gen_sim_index function?
You're saving your data 6 times over in 6 different variables. Try not doing that.
and it takes a day because you're running character indexing on a ridiculous amount of data with gsub().
Take your code out of your gen_fuzzy_blau function as it provides no value to wrap it up into one function rather than running it all independently. Then run it all one line at a time. If it takes too long to run, reconsider your method. Your code is incredibly inefficient.

Resources