Running 'xlsx' processes in parallel, using the 'parallel' R package - r

I have a project where I need to process some data from an Excel file with R. I must use the 'xlsx' package because of some specific functions.
First, I wrote a script, which works as expected without errors.
options(java.parameters = "-Xmx4096m") #for extra memory
library(xlsx)
wb <- loadWorkbook(file = "my_excel.xlsx")
sheet1 <- getSheets(wb)[[1]]
rows <- getRows(sheet1)
make_df <- function (x) {
cells <- getCells(rows[x])
styles <- sapply(cells, getCellStyle)
cellColor <- function(style) {
fg <- style$getFillForegroundXSSFColor()
rgb <- tryCatch(fg$getRgb(), error = function(e) NULL)
rgb <- paste(rgb, collapse = "")
return(rgb)
}
colors <- sapply(styles, cellColor)
if (!any(colors == "ff0000")) {
df[nrow(df) + 1, ] <- sapply(cells, getCellValue) #I define this 'df' somewhere in the code; this part could be improved
}
}
df <- sapply(1 : length(rows), make_df)
In short, I am looking for the rows in Excel where there are no red-colored cells, like described here. The problem is that the Excel file is very big, and it takes a lot of time to process.
What I'd like to do is to run the row checking in parallel, to be more efficient, so I added:
cl = makeCluster(detectCores() - 1)
clusterEvalQ(cl=cl, c(library(xlsx))) #sharing the package with the workers
clusterExport(cl = cl, c('rows')) #sharing the 'row' variable with the workers
df <- parSapply(cl, 1 : length(rows), make_df)
And after running this, I get the following error:
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: RcallMethod: attempt to call a method of a NULL object.
I tried the parallelization with another example, without using 'xlsx' functions, and it worked.
After some digging, I found this post which offered somewhat of an answer (more like a workaround), but I can't seem to be able to implement it.
Is there a clean way to do what I'm trying to do here?
If not, then what would be the best solution in this case?

Related

Data frame creation inside Parlapply in R

I am trying something pretty simple, want to run a bunch of regressions parallelly. When I use the following data generator (PART 1), The parallel part does not work and give the error listed below
#PART 1
p <- 20; rho<-0.7;
cdc<- diag(p)
for( i in 1:(p-1) ){ for( j in (i+1):p ){
cdc[i,j] <- cdc[j,i] <- rho^abs(i-j)
}}
my.data <- mvrnorm(n=100, mu = rep(0, p), Sigma = cdc)
The following Parallel Part does work but if I generate the data as PART 2
# PART 2
my.data<-matrix(rnorm(1000,0,1),nrow=100,ncol=10)
I configured the function that I want to run parallelly... as
parallel_fun<-function(obj,my.data){
p1 <- nrow(cov(my.data));store.beta<-matrix(0,p1,length(obj))
count<-1
for (itration in obj) {
my_df<-data.frame(my.data)
colnames(my_df)[itration] <- "y"
my.model<-bas.lm(y ~ ., data= my_df, alpha=3,
prior="ZS-null", force.heredity = FALSE, pivot = TRUE)
cf<-coef(my.model, estimator="MPM")
betas<-cf$postmean[-1]
store.beta[ -itration, count]<- betas
count<-count+1
}
result<-list('Beta'=store.beta)
}
So I write the following way of running parlapply
{
no_cores <- detectCores(logical = TRUE)
myclusternumber<-(no_cores-1)
cl <- makeCluster(myclusternumber)
registerDoParallel(cl)
p1 <- ncol(my.data)
obj<-splitIndices(p1, myclusternumber)
clusterExport(cl,list('parallel_fun','my.data','obj'),envir=environment())
clusterEvalQ(cl, {
library(MASS)
library(Matrix)
library(BAS)
})
newresult<-parallel::parLapply(cl,obj,fun = parallel_fun,my.data)
stopCluster(cl)
}
But whenever am doing PART 1 I get the following error
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: object 'my_df' not found
But this should not happen, the data frame should be created, I have no idea why this is happening. Any help is appreciated.
Posting this as one possible workaround, see if it works:
parallel_fun<-function(obj,my.data){
p1 <- nrow(cov(my.data));store.beta<-matrix(0,p1,length(obj))
count<-1
for (itration in obj) {
my_df<-data.frame(my.data)
colnames(my_df)[itration] <- "y"
my_df <<- my_df
my.model<-bas.lm(y ~ ., data= my_df, alpha=3,
prior="ZS-null", force.heredity = FALSE, pivot = TRUE)
cf<-BAS:::coef.bas(my.model, estimator="MPM")
betas<-cf$postmean[-1]
store.beta[ -itration, count]<- betas
count<-count+1
}
result<-list('Beta'=store.beta)
}
The issue seems to be with BAS:::coef.bas function, that calls eval in order to get my_df and fails to do that when called in parallel. The "hack" here is to force my_df out to the parent environment by calling my_df <<- my_df.
There should be a better way to do this, but <<- might be the fastest one. In general, <<- may cause unwanted behaviour, especially when used in loops. Assigning unique variable name before exporting (and don't forgetting to remove after use) is one way to tackle them.

Errors when parallelizing for loop

I'm new to parallel-processing and attempting to parallelize a for loop in which I create new columns in a data frame by matching a column in said data frame with two other data frames. j, the data frame I'm attempting to create columns in is 400000 x 54. a and c, the two data frames I'm matching j with are 5000 x 12 and 45000 x 8 (respectively).
Below is my initial loop prior to the attempt at parallelizing:
for(i in 1:nrow(j)) {
if(j$Inspection_Completed[i] == TRUE) {
next
}
j$Assigned_ID <- a$Driver[match(j$car_name, a$CarName)]
j$Title <- c$Title[match(j$Site_ID, c$LocationID)]
j$Status <- c$Status[match(j$Site_ID, c$LocationID)]
}
So far I have attempted the following:
cl <- snow::makeCluster(4)
doSNOW::registerDoSNOW(cl)
foreach::foreach(i = 1:nrow(j)) foreach::`%dopar%` {
if(j$Inspection_Completed[i] == TRUE) {
next
}
j$Assigned_ID <- a$Driver[match(j$car_name, a$CarName)]
j$Title <- c$Title[match(j$Site_ID, c$LocationID)]
j$Status <- c$Status[match(j$Site_ID, c$LocationID)]
}
stopCluster(cl)
However, when I run the code above I receive several errors.
Error: unexpected symbol in "foreach::foreach(i = 1:nrow(j)) foreach"
And:
Error: object 'i' not found
Lastly:
Error: unexpected '}' in "}"
I'm not sure why I'm getting these errors. None of the columns in any of the data frames are factors and I haven't been able to spot any mismatched parentheses or brackets. I've also done this without the snow and doSNOW packages and the result is the same. I've ran it without the tick marks around dopar as well with the same result.
(I didn't know this before.)
R doesn't like infix operators with the ::-notation. Even if you're doing that for namespace management, R isn't having it:
1L %in% 1:2
# [1] TRUE
1L base::%in% 1:2
# Error: unexpected symbol in "1L base"
1L base::`%in%` 1:2
# Error: unexpected symbol in "1L base"
Workarounds:
Redefine your own infix that just mimics the other, as in
`%myin%` <- base::`%in%`
1L %myin% 1:2
# [1] TRUE
Use explicit namespace inclusion with library(foreach) before that point in your code, and just use %dopar%. (Not that it helps much, but using library(foreach) does not mean you cannot use foreach::foreach, though it is unnecessary.)

Problems with changing a function variable inside another lower function in R?

I need to open files (matrices) from a directory and apply a function pca on each one. It uses another function count_pc which is thought to null diagonals in the matrix step by step and add recalculated PC1 to a the table pcs from the previous function. At the start, I didn't think of environments so count_pca was crashing with the error "unknown variable". Then I tried to do it this way:
files <- list.files()
count_pc <- function(x, env = parent.frame()) {
diag(file[x:nrow(file),]) <- 0
diag(file[,x:nrow(file)]) <- 0
pcn <- prcomp(file, scale = FALSE)
pcn <- data.frame(pcn$rotation)
pcs <- cbind(pcs, pcn$PC1)
}
pca <- function(filename) {
file <- as.matrix(read.table(filename))
pc <- prcomp(file, scale = FALSE)
pc <- data.frame(pc$rotation)
pc1 <- pc$PC1
pcs <- data.frame(pc1)
for (k in 1:40) {
count_pc(k)
}
new_filename <- strsplit(filename, "_")[[1]][3]
print(pcs)
colnames(pcs) <- paste0(0:40, rep("_bins_deleted", 40))
write.table(pcs, file=paste(new_filename, "eigenvectors", sep="_"))
return(apply(pcs, 2, cor, y = pc1))
}
ldply(files, pca)
And indeed, count_pc does not crash with above error but, unfortunately, it crashes with the new one:
"colnames<-`(`*tmp*`, value = c("0_bins_deleted", "1_bins_deleted", :
'names' [41] attribute must be the same length as the vector [1]"
which means that count_pc does not change needed variables. First, I thought the problem might be connected with using sapply(1:40, count_pc) so I replaced it with a cycle. But it didn't help. I've also tried to use environment(count_pc) <- environment() in the pca but it didn't help either (as well as changing variable names in count_pc to env$'name'). I don't know what to do and googling doesn't seem to help.

dim(X) must have a positive length

I am trying to call a Azure Machine Learning web service from Microsoft Power BI (Visualization tool) through R. The process demands input to be given as a list. So for that I am converting my input to a list in R. Below is my code.
dataset <- data.frame(sqlQuery(conn, "SELECT * FROM dbo.Automobile"))
close(conn)
if(nrow(dataset)>0)
{
dataset <- dataset[,c(-1, -14)]
dataset <- na.omit(dataset)
createList <- function(dataset)
{
temp <- apply(dataset, 1, function(x) as.vector(paste(x, sep = "")))
colnames(temp) <- NULL
temp <- apply(temp, 2, function(x) as.list(x))
return(temp)
}
...
I am very new to R so above code is from Power BI's documentation only. But is gives the following error :
dim(X) must have a positive length
I tried googling this error and applied some of the workarounds like
1. using lapply function
2. adding drop=F
but kept on returning errors.
Can anyone help me with this ?

R parallel foreach loops with spatial data

I have a large Spatial Lines Data Frame in R called lines which I want to apply the function line2route to from stplanr
To speed up the process I wanted to break the file up into chuncks and run them in parallel.
library(doParallel)
batch_size <- ceiling(nrow(lines) / 6)
cl <- makeCluster(6)
registerDoParallel(cl)
foreach(i = 1:6) %dopar% {
l_start <- as.integer(1 + (i - 1) * batch_size)
if(i * batch_size < nrow(lines)){
l_fin <- as.integer(i * batch_size)
}else{
l_fin <- as.integer(nrow(lines))
}
lines_sub <- lines[c(l_start:l_fin),]
rq <- line2route(l = lines_sub, route_fun = route_cyclestreet, plan = "quietest")
saveRDS(rq, file = paste0("../temp/rq_batch_", i, ".Rds"))
}
The code breaks up lines into 6 parts and runs the function, then saves the results.
This works fine in a for loop but when I change it to a foreach loop and try to do it in parallel I get the error message
Error in { : task 1 failed - "c("assignment of an object of class
\"tbl_df\" is not valid for #'data' in an object of class
\"SpatialLinesDataFrame\"; is(value, \"data.frame\") is not TRUE",
"assignment of an object of class \"tbl\" is not valid for #'data' in
an object of class \"SpatialLinesDataFrame\"; is(value,
\"data.frame\") is not TRUE", "assignment of an object of class
\"data.frame\" is not valid for #'data' in an object of class
\"SpatialLinesDataFrame\"; is(value, \"data.frame\") is not TRUE")"
Is is possible to run a foreach loop with spatial data? I'm not worried about rejoining the data at the end as I can do that separately later.
I found this issue when I had some spatial data frame which I had used a dplyr function on e.g.:
sp1#data <- bind_cols(sp1#data, new_col)
I believe that dplyr converts the spatial dataframe (sp1#data) to a tbl_df. To solve this issue, all I had to do was:
sp1#data <- as.data.frame(bind_cols(sp1#data, new_col))
In a nutshell, you just have to ensure that all your #data's are proper data frames, not tbl_df.

Resources