Problem with accessing elements from parLapply() output - r

I have a problem with accessing elements from the output of parLapply(). When I use the non-parallel lapply() function I can access the elements with the following code.
out_list <- lapply(list, function)
out_list[[2]][1:5, 1:5] # out_list[[2]] is a matrix in my specific case
But when I try to do the same, but with the output of the parLapply() function, I get an error.
The code:
out_list <- parLapply(cl = cluster, list, function)
out_list[[2]][1:5, 1:5]
The error message:
in extract_matrix(x, i, j, ...) :
out_list instance has been unmapped.
Here is the full code:
#!/usr/bin/Rscript
path_to_files = '***********'
file.names <- list.files(path = path_to_files, pattern = "*.bed", full.names = TRUE, recursive = FALSE) # making a list of the desired files
# sequentially -------------------------------------------------------------------------------------------------------------------------------------
library(BGData)
print("Executing lapply...")
example_BEDMatrix_list <- lapply(file.names, BEDMatrix)
print("lapply() done.")
example_BEDMatrix_list[[4]][1:5, 1:5]
#------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# parallel ------------------------------------------------------------------------------------------------------------------------------------------------------
library(BGData)
library(parallel)
print("Creating cluster...")
copies_of_r <- detectCores() - 5
cluster <- makeCluster(copies_of_r)
clusterExport(cl=cluster, c('file.names'))
print("Cluster created")
print("Executing parLapply()...")
BEDMatrix_list <- parLapply(cluster, file.names[2:4], BEDMatrix)
BEDMatrix_list[[2]][1:5, 1:5]
print("parLapply() executed")
print("stopping cluster...")
stopCluster(cluster)
print("Cluster stopped")
How can I fix this?

Related

Read many files in parallel and extract data

I have 1000 json files. And I would like to read them in parallel. I have 4 CPU cores.
I have a character vector which has the names of all the files as following:-
cik_files <- list.files("./data/", pattern = ".json")
And using this vector I load the file and extract the data and add it to the following list:-
data <- list()
Below is the code for extracting the data:-
for(i in 1:1000){
data1 <- fromJSON(paste0("./data/", cik_files[i]), flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
data1 <- data1[grep("CY20[0-9]{2}$", data1$frame), c(3, 9)]
try({if(nrow(data1) > 0){
data1$cik <- strtrim(cik_files[i], 13)
data[[length(data) + 1]] <- data1
}}, silent = TRUE)
}
}
This however, takes quite a lot of time. So I was wondering how I can run the code within the for loop but in parallel.
Thanks in advance.
Here is an attempt to solve the problem in the question. Untested, since there is no data.
Step 1
First of all, rewrite the loop in the question as a function.
f <- function(i, path = "./data", cik_files){
filename <- file.path(path, cik_files[i])
data1 <- fromJSON(filename, flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
found <- grep("CY20[0-9]{2}$", data1$frame)
if(length(found) > 0){
tryCatch({
out <- data1[found, c(3, 9)]
out$cik <- strtrim(cik_files[i], 13)
out
},
error = function(e) e,
warning = function(w) w)
} else NULL
} else NULL
}
Step 2
Now load the package parallel and run one of the following, depending on OS.
library(parallel)
# Not on Windows
library(jsonlite)
json_list <- mclapply(seq_along(cik_files), f, cik_files = cik_files)
# Windows
ncores <- detectCores()
cl <- makeCluster(ncores - 1L)
clusterExport(cl, "cik_files")
clusterEvalQ(cl, "cik_files")
clusterEvalQ(cl, library(jsonlite))
json_list <- parLapply(cl, seq_along(cik_files), f, cik_files = cik_files)
stopCluster(cl)
Step 3
Extract the data from the returned list json_list.
err <- sapply(json_list, inherits, "error")
warn <- sapply(json_list, inherits, "warning")
ok <- !(err | warn)
json_list[ok] # correctly read in

R programming (beginner): Combining two lists--> dataframe -> csv

I tried to combine two lists into one dataframe:
all_stas <- list()
for(i in vid_id){
stas <- get_stats(video_id = i)
all_stas <- rbind(all_stas,stas)
}
View(all_stas)
all_detail <- list()
for(i in vid_id){
detail1 <- get_video_details(video_id = i)
all_detail <- rbind(all_detail,detail1)
}
View(all_detail)
df <- data.frame(all_stas,all_detail)
write.csv(df, file = "new_file.csv")
Afterwards I would like to store it into a csv file.
When I run it it gives me the following warning message
Warning message:
In rbind(all_stas, stas) :
number of columns of result is not a multiple of vector length (arg 2)
Does anyone of you know how I can make the code work?
This block below is triggering an error
all_stas <- list()
for(i in vid_id){
stas <- get_stats(video_id = i)
all_stas <- rbind(all_stas,stas)}
If I understand your code correctly you can get around that error by
all_stas <- list()
for(i in vid_id){
all_stas[[i]] <- get_stats(video_id = i)}

Using foreach instead of for loop

I am trying to learn foreach to parallelise my task
My for-loop looks like this:
# create an empty matrix to store results
mat <- matrix(-9999, nrow = unique(dat$mun), ncol = 2)
for(mun in unique(dat$mun)) {
dat <- read.csv(paste0("data",mun,".csv")
tot.dat <- sum(dat$x)
mat[mat[,1]== mun,2] <- tot.dat
}
unique(dat$mun) has a length of 5563.
I want to use foreach to pararellise my task.
library(foreach)
library(doParallel)
# number of iterations
iters <- 5563
foreach(icount(iters)) %dopar% {
mun <- unique(dat$mun)[mun] # this is where I cannot figure out how to assing mun so that it read the data for mun
dat <- read.csv(paste0("data",mun,".csv")
tot.dat <- sum(dat$x)
mat[mat[,1]== mun,2] <- tot.dat
}
This could be one solution.
Do note that I'm using windows here, and i specified registerDoParallel() for it to work.
library(foreach)
library(doParallel)
# number of iterations
iters <- 5563
registerDoParallel()
mun <- unique(dat$mun)
tableList <- foreach(i=1:iters) %dopar% {
dat <- read.csv(paste0("data",mun[i],".csv")
tot.dat <- sum(dat$x)
}
unlist(tableList)
Essentially, whatever result inside {...} will be stored inside a list.
In this case, the result (tot.dat which is a number) is compiled in tableList, and by performing unlist() we can convert it to a vector for further use.
The result inside {...} can be anything, a single number, a vector, a dataframe, or anything.
Another approach for your problem would be to combine all existing data together, labelling it with its appropriate source file, so the middle component will look something like
library(plyr)
tableAll <- foreach(i=1:iters) %dopar% {
dat <- read.csv(paste0("data",mun[i],".csv")
dat$source = mun[i]
}
rbind.fill(tableAll)
Then we can use it for further analysis.

Can't find variable when parallel

When I tried this snippet of R code. I have problem in parallel
# include library
require(stats)
library(GMD)
library(parallel)
# include function
source('~/Workspaces/Projects/RProject/MovielensCluster/readData.R'); # contain readtext.convert() function
###
elbow.k <- function(mydata){
## determine a "good" k using elbow
dist.obj <- dist(mydata);
hclust.obj <- hclust(dist.obj);
css.obj <- css.hclust(dist.obj,hclust.obj);
elbow.obj <- elbow.batch(css.obj);
# print(elbow.obj)
k <- elbow.obj$k
return(k)
}
# include file
filePath <- "dataset/u.user";
data.convert <- readtext.convert(filePath);
data.clustering <- data.convert[,c(-1,-4)];
# find k value
no_cores <- detectCores();
cl<-makeCluster(no_cores);
clusterExport(cl, list("data.clustering", "data.original", "elbow.k", "clustering.kmeans"));
start.time <- Sys.time();
k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering));
end.time <- Sys.time();
cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters);
I has an error notification:
Error in get(name, envir = envir) : object 'data.original' not found
Error in checkForRemoteErrors(val) :
one node produced an error: could not find function "elbow.k"
Can anyone help me to fix it ? Thanks a lot.
I think your problem relate to "variable scope". On Mac/Linux you have the option of using makeCluster(no_core, type="FORK") that automatically contains all environment variables. On Windows you have to use the Parallel Socket Cluster (PSOCK) that starts out with only the base packages loaded. Thus, you always specifiy exactly what variables as well as library that you include for parallel function to work. clusterExport() and clusterEvalQ() are necessary so as to the function to see the needed variables and packages respectively. Note that any changes to the variable after clusterExport are ignored. Comeback to your problem. You must use as following:
clusterEvalQ(cl, library(GMD));
and your full code:
# include library
require(stats)
library(GMD)
library(parallel)
# include function
source('~/Workspaces/Projects/RProject/MovielensCluster/readData.R'); # contain readtext.convert() function
###
elbow.k <- function(mydata){
## determine a "good" k using elbow
dist.obj <- dist(mydata);
hclust.obj <- hclust(dist.obj);
css.obj <- css.hclust(dist.obj,hclust.obj);
elbow.obj <- elbow.batch(css.obj);
# print(elbow.obj)
k <- elbow.obj$k
return(k)
}
# include file
filePath <- "dataset/u.user";
data.convert <- readtext.convert(filePath);
data.clustering <- data.convert[,c(-1,-4)];
# find k value
no_cores <- detectCores();
cl<-makeCluster(no_cores);
clusterEvalQ(cl, library(GMD));
clusterExport(cl, list("data.clustering", "data.original", "elbow.k", "clustering.kmeans"));
start.time <- Sys.time();
k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering));
end.time <- Sys.time();
cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters);

how to save the output of a foreach loop in R

I have a trouble with saving my data output after foreach loop
here is the function to read my data and process it
readFiles <- function(x){
data <- read.table("filelist",
skip=grep('# Begin: Data Text', readLines(filelist)),
na.strings=c("NA", "-", "?"),
colClasses="numeric")
my <- as.matrix(data[1:57600,2]);
mesh <- array(my, dim = c(120,60,8));
Ms <- 1350*10^3 # A/m
asd2 <- (mesh[70:75,24:36 ,2])/Ms; # in A/m
ort_my <- mean(asd2);
return(ort_my)
}
here is the codes for doing parallel process
#R Code to run functions in parallel
detectCores() #This will tell you how many cores are available
library("foreach");
library("parallel");
library(doParallel)
#library("doMC") this is for Linux
#registerDoMC(12) #Register the parallel backend
cl<-makeCluster(4)
registerDoParallel(cl) # Register 12 cpu for the parallel backend
OutputList <- foreach(i=1:length(filelist),
.combine='c', .packages=c("data.table")) %dopar% (readFiles)
#registerDoSEQ() #Very important to close out parallel backend.
aa<-OutputList
stopCluster(cl)
print(Sys.time()-strt)
write.table(aa, file="D:/ads.txt",sep='\t')
Everything goes smoothly but when I check OutputList what I see only function(x)
I want to write ort_my for each file in filelist.
here is what I see
[[70]]
function (x)
{
data <- read.table("filelist", skip = grep("# Begin: Data Text",
readLines(filelist)), na.strings = c("NA", "-", "?"),
colClasses = "numeric")
my <- as.matrix(data[1:57600, 2])
mesh <- array(my, dim = c(120, 60, 8))
Ms <- 1350 * 10^3
asd2 = (mesh[70:75, 24:36, 2])/Ms
ort_my <- mean(asd2)
return(ort_my)
}
<environment: 0x00000000151aef20>
How can I do that?
best regards
Now I used doSNOW package to do same thing
library(foreach)
library(doSNOW)
getDoParWorkers()
getDoParName()
registerDoSNOW(makeCluster(8, type = "SOCK"))
getDoParWorkers()
getDoParName()
strt<-Sys.time()
data1 <- list() # creates a list
filelist <- dir(pattern = "*.omf") # creates the list of all the csv files in the directory
i=1:length(filelist)
readFiles <- function(m){ for (k in 1:length(filelist))
data[[k]] <- read.csv(filelist[k],sep = "",as.is = TRUE, comment.char = "", skip=37); # to read .omf files skip 37 skips 37 line of the header
my <- as.matrix(data[[k]][1:57600,2]);
mesh <- array(my, dim = c(120,60,8));
Ms<-1350*10^3 # A/m
asd2=(mesh[70:75,24:36 ,2])/Ms; # in A/m
ort_my<- mean(asd2);
return(ort_my)
}
out <- foreach(m=1:i, .combine=rbind,.verbose=T) %dopar% readFiles(m)
print(Sys.time()-strt)
I have error messages in following;
Error in readFiles(m) :
task 1 failed - "object of type 'closure' is not subsettable"
In addition: Warning message:
In 1:i : numerical expression has 70 elements: only the first used
As ?"%dopar%" states, in obj %dopar% ex, ex is an R expression to evaluate. If your free variable in foreach is i, you should use readFiles(i). Currently, you're in fact returning a function object.
BTW, you have some mess in the code. For example, I think that readFiles is independent of x (even if it has x as a formal argument)... Shouldn't it be readLines(filelist[[x]])?

Resources