reading and processing files in parallel in R - r

I am using the parallel library in R to process a large data set on which I am applying complex operations.
For the sake of providing a reproducible code, you can find below a simpler example:
#data generation
dir <- "C:/Users/things_to_process/"
setwd(dir)
for(i in 1:800)
{
my.matrix <- matrix(runif(100),ncol=10,nrow=10)
saveRDS(my.matrix,file=paste0(dir,"/matrix",i))
}
#worker function
worker.function <- function(files)
{
files.length <- length(files)
partial.results <- vector('list',files.length)
for(i in 1:files.length)
{
matrix <- readRDS(files[i])
partial.results[[i]] <- sum(diag(matrix))
}
Reduce('+',partial.results)
}
#master part
cl <- makeCluster(detectCores(), type = "PSOCK")
file_list <- list.files(path=dir,recursive=FALSE,full.names=TRUE)
part <- clusterSplit(cl,seq_along(file_list))
files.partitioned <- lapply(part,function(p) file_list[p])
results <- clusterApply(cl,files.partitioned,worker.function)
result <- Reduce('+',results)
Essentially, I am wondering if trying to read files in parallel would be done in an interleaved fashion instead. And if, as a result, this bottleneck would cut down on the expected performance of running tasks in parallel?
Would it be better if I first read all matrices at once in a list then sent chunks of this list to each core for it to be processed? what if these matrices were much larger, would I be able to load all of them in a list at once ?

Instead of saving each matrix in a separate RDS file, have you tried saving a list of N matrices in each file, where N is the number that is going to be processed by a single worker?
Then the worker.function looks like:
worker.function <- function(file) {
matrix_list <- readRDS(file)
partial_results <- lapply(matrix_list, function(mat) sum(diag(mat)))
Reduce('+',partial.results)
}
You should save some time on I/O and maybe even on computation by replacing a for with a lapply.

Related

Understanding writeValues of raster by parallel processing. Is it possible to writevalues for each raster while using mclapply fork cluster. R

I try to understand how to parallelize raster processing in R. My Goal ist to parallize the following on multiple cores with multiple rasters.
I process my raster blockwise and i try to parallelize it with mclapply or other functions. First i want to get the values of one raster or a rasterstack. and then i want to write the values to the object. When i am using multiple cores, it does not work, because different sub Processes want to write on the same time. Somebody know a solution for that?
So here is the process:
get and create data
r <- raster(system.file("external/test.grd", package="raster"))
s <- raster(r)
tr <- blockSize(r)
then getValues and writevalues with a for loop
s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
for (i in 1:tr$n) {
v <- getValuesBlock(r, row=tr$row[i], nrows=tr$nrows[i])
s <- writeValues(s, v, tr$row[i])
}
s <- writeStop(s)
this works fine
now trying the same on lapply
s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
#working with lapply
lapply(1:tr$n, function(x){
v <- getValues(r, tr$row[x], tr$nrows[x])
s <- writeValues(s,v,tr$row[x])
})
s <- writeStop(s)
works fine
Now trying with mclapply with one core
s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
#does work with mclapply one core
parallel::mclapply(1:tr$n, function(x){
v <- getValues(r, tr$row[x], tr$nrows[x])
s <- writeValues(s,v,tr$row[x])
}, mc.cores = 1)
s <- writeStop(s)
also works
now trying with mclapply on multiple cores
s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
#does not work with multiple core
parallel::mclapply(1:tr$n, function(x){
v <- getValues(r, tr$row[x], tr$nrows[x])
s <- writeValues(s,v,tr$row[x])
}, mc.cores = 2)
s <- writeStop(s)
So that does not work. I understand the logic, why it does not work.
My question now is: Suppose I have a rasterstack with 2 rasters. Could I use mclapply or another function from the parallel package to write this process differently. So I get the values of the block for both grids at the same time, but these values are only written to one rater per core.
For the solution I am looking for it is not acceptable to first get all values, safe them in an object and then write the values blockwise, because my rasters are to large.
I would be very happy if someone has a solution or just an idea or suggestion.
Thanks.
I believe the object returned by raster::writeStart() can only be processed in the same R process as it was created. That is, it is not possible for a parallel R process to work with it.
The fact that the object uses an external pointer internally is a strong indicator that it cannot be exported to another R process or saved to file or read back again. You can check for external pointers using (non-public) future:::assert_no_references(), e.g.
> library(raster)
> r <- raster(system.file("external/test.grd", package="raster"))
> future:::assert_no_references(r)
NULL ## == no external pointer
> s <- raster(r)
> future:::assert_no_references(s)
NULL ## == no external pointer
> s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
> future:::assert_no_references(s)
Error: Detected a non-exportable reference ('externalptr') in one of the globals (<unknown>) used in the future expression

How to efficiently XML parse in R

I use R to parse XML data from a website. I have list of 20,000 rows with URLs from which I need to extract data. I have a code which gets the job done using a for loop, but it's very slow (takes approx. 12 hours). I thought of using parallel processing (I have access to several CPUs) to speed it up, but I cannot make it work properly. Would it be more efficient using a data table instead of a data frame? Is there any way to speed the process up? Thanks!
for (i in 1:nrow(list)) {
t <- xmlToDataFrame(xmlParse(read_xml(list$path[i]))) #Read the data into a file
t$ID <- list$ID[i]
emptyDF <- bind_rows(all, t) #Bind all into one file
if (i / 10 == floor(i / 10)) {
print(i)
} #print every 10th value to monitor progress of the loop
}
This script should point you in the correct direction:
t<-list()
for (i in 1:nrow(list)) {
tempdf <- xmlToDataFrame(xmlParse(list$path[i])) #Read the data into a file
tempdf$ID <- list$ID[i]
t[[i]]<-tempdf
if (i %% 10 == 0) {
print(i)
} #print every 10th value to monitor progress of the loop
}
answer <- bind_rows(t) #Bind all into one file
Instead of a for loop, an lapply would also work here. Without any sample data, this is untested.

R bigmemory always use backing file?

We are trying to use the BigMemory library with foreach to parallel our analysis. However, the as.big.matrix function seems always use backingfile. Our workstations have enough memory, is there a way to use bigMemory without the backing file?
This code x.big.desc <-describe(as.big.matrix(x)) is pretty slow as it write the data to C:\ProgramData\boost_interprocess\. Somehow it is slower than save x directly, is it as.big.matrix that have a slower I/O?
This code x.big.desc <-describe(as.big.matrix(x, backingfile = "")) is pretty fast, however, it will also save a copy of the data to %TMP% directory. We think the reason it is fast, because R kick off a background writing process, instead of actually writing the data. (We can see the writing thread in TaskManager after the R prompt returns).
Is there a way to use BigMemory with RAM only, so that each worker in foreach loop can access the data via RAM?
Thanks for the help.
So, if you have enough RAM, just use standard R matrices. To pass only a part of each matrix to each cluster, use rdsfiles.
One example computing the colSums with 3 cores:
# Functions for splitting
CutBySize <- function(m, nb) {
int <- m / nb
upper <- round(1:nb * int)
lower <- c(1, upper[-nb] + 1)
size <- c(upper[1], diff(upper))
cbind(lower, upper, size)
}
seq2 <- function(lims) seq(lims[1], lims[2])
# The matrix
bm <- matrix(1, 10e3, 1e3)
ncores <- 3
intervals <- CutBySize(ncol(bm), ncores)
# Save each part in a different file
tmpfile <- tempfile()
for (ic in seq_len(ncores)) {
saveRDS(bm[, seq2(intervals[ic, ])],
paste0(tmpfile, ic, ".rds"))
}
# Parallel computation with reading one part at the beginning
cl <- parallel::makeCluster(ncores)
doParallel::registerDoParallel(cl)
library(foreach)
colsums <- foreach(ic = seq_len(ncores), .combine = 'c') %dopar% {
bm.part <- readRDS(paste0(tmpfile, ic, ".rds"))
colSums(bm.part)
}
parallel::stopCluster(cl)
# Checking results
all.equal(colsums, colSums(bm))
You could even use rm(bm); gc() after writing parts to the disk.

reading multiple files quickly in R

I have over 10000 csv files and I need to do a Fast Fourier Transformation on each column of each csv file. I have access to 1000 cores. What should be the fastest way?
Currently I have a for loop reading each file sequentially and using the apply(data, 2, FFT) function. How would I do so? I tried doing clusterapply(1:10000, cl, transformation). In the transformation function, I have read csv. It still takes a long time to do all the reading. Do anyone of you know a faster way?
I would think that the fastest way would be to mclapply and fread.
#Bring in libraries
library(parallel)
library(data.table)
#Find all csv files in your folder
csv.list = list.files(pattern="*.csv")
#Create function to read in data and perform fft on each column
read.fft <- function(x) {
data <- fread(x)
result <- data[, lapply(.SD,fft)]
return(result)
}
#Apply function using multiple cores
all.results <- mclapply(csv.list,read.fft,mc.cores=10)
If it makes sense for you to take a random sample of each dataset, I would highly suggest changing the read.fft function to use the shuf command. It will spend up your read-in time by quite a bit.
#Create function to read in data and perform fft
read.fft <- function(x) {
data <- fread(paste0("shuf -n 10000",x)) #Takes random sample of 10000 rows
result <- data[, lapply(.SD,fft)]
return(result)
}

Make function and apply to read data in R?

I have set of data (around 50000 data. and each one of them 1.5 mb). So, to load the data and process the data first I have used this code;
data <- list() # creates a list
listcsv <- dir(pattern = "*.txt") # creates the list of all the csv files in the directory
then I use for loop to load each data;
for (k in 1:length(listcsv)){
data[[k]]<- read.csv(listcsv[k],sep = "",as.is = TRUE, comment.char = "", skip=37);
my<- as.matrix(as.double(data[[k]][1:57600,2]));
print(ort_my);
a[k]<-ort_my;
write(a,file="D:/ddd/ads.txt",sep='\t',ncolumns=1)}
So, I set the program run but even if after 6 hours it didn't finished. Although I have a decent pc with a 32 GB ram and 6 core CPU.
I have searched the forum and maybe fread function would be helpful people say. However all the examples which I found so far deal with the single file reading with the fread function.
Can any one suggest me the solution of this problem for faster loop to read data and process it with these many rows and columns?
I am guessing there has to be a way to make the extraction of what you want more efficient. But I think running in parallel could save you a bunch of time. And save you memory by not storing each file.
library("data.table")
#Create function you want to eventually loop through in parallel
readFiles <- function(x) {
data <- fread(x,skip=37)
my <- as.matrix(data[1:57600,2,with=F]);
mesh <- array(my, dim = c(120,60,8));
Ms<-1350*10^3 # A/m
asd2=(mesh[70:75,24:36 ,2])/Ms; # in A/m
ort_my<- mean(asd2);
return(ort_my)
}
#R Code to run functions in parallel
library(“foreach”);library(“parallel”);library(“doMC”)
detectCores() #This will tell you how many cores are available
registerDoMC(8) #Register the parallel backend
#Can change .combine from rbind to list
OutputList <- foreach(listcsv,.combine=rbind,.packages=c(”data.table”)) %dopar% (readFiles(x))
registerDoSEQ() #Very important to close out parallel backend.

Resources