We are trying to use the BigMemory library with foreach to parallel our analysis. However, the as.big.matrix function seems always use backingfile. Our workstations have enough memory, is there a way to use bigMemory without the backing file?
This code x.big.desc <-describe(as.big.matrix(x)) is pretty slow as it write the data to C:\ProgramData\boost_interprocess\. Somehow it is slower than save x directly, is it as.big.matrix that have a slower I/O?
This code x.big.desc <-describe(as.big.matrix(x, backingfile = "")) is pretty fast, however, it will also save a copy of the data to %TMP% directory. We think the reason it is fast, because R kick off a background writing process, instead of actually writing the data. (We can see the writing thread in TaskManager after the R prompt returns).
Is there a way to use BigMemory with RAM only, so that each worker in foreach loop can access the data via RAM?
Thanks for the help.
So, if you have enough RAM, just use standard R matrices. To pass only a part of each matrix to each cluster, use rdsfiles.
One example computing the colSums with 3 cores:
# Functions for splitting
CutBySize <- function(m, nb) {
int <- m / nb
upper <- round(1:nb * int)
lower <- c(1, upper[-nb] + 1)
size <- c(upper[1], diff(upper))
cbind(lower, upper, size)
}
seq2 <- function(lims) seq(lims[1], lims[2])
# The matrix
bm <- matrix(1, 10e3, 1e3)
ncores <- 3
intervals <- CutBySize(ncol(bm), ncores)
# Save each part in a different file
tmpfile <- tempfile()
for (ic in seq_len(ncores)) {
saveRDS(bm[, seq2(intervals[ic, ])],
paste0(tmpfile, ic, ".rds"))
}
# Parallel computation with reading one part at the beginning
cl <- parallel::makeCluster(ncores)
doParallel::registerDoParallel(cl)
library(foreach)
colsums <- foreach(ic = seq_len(ncores), .combine = 'c') %dopar% {
bm.part <- readRDS(paste0(tmpfile, ic, ".rds"))
colSums(bm.part)
}
parallel::stopCluster(cl)
# Checking results
all.equal(colsums, colSums(bm))
You could even use rm(bm); gc() after writing parts to the disk.
Related
I try to understand how to parallelize raster processing in R. My Goal ist to parallize the following on multiple cores with multiple rasters.
I process my raster blockwise and i try to parallelize it with mclapply or other functions. First i want to get the values of one raster or a rasterstack. and then i want to write the values to the object. When i am using multiple cores, it does not work, because different sub Processes want to write on the same time. Somebody know a solution for that?
So here is the process:
get and create data
r <- raster(system.file("external/test.grd", package="raster"))
s <- raster(r)
tr <- blockSize(r)
then getValues and writevalues with a for loop
s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
for (i in 1:tr$n) {
v <- getValuesBlock(r, row=tr$row[i], nrows=tr$nrows[i])
s <- writeValues(s, v, tr$row[i])
}
s <- writeStop(s)
this works fine
now trying the same on lapply
s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
#working with lapply
lapply(1:tr$n, function(x){
v <- getValues(r, tr$row[x], tr$nrows[x])
s <- writeValues(s,v,tr$row[x])
})
s <- writeStop(s)
works fine
Now trying with mclapply with one core
s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
#does work with mclapply one core
parallel::mclapply(1:tr$n, function(x){
v <- getValues(r, tr$row[x], tr$nrows[x])
s <- writeValues(s,v,tr$row[x])
}, mc.cores = 1)
s <- writeStop(s)
also works
now trying with mclapply on multiple cores
s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
#does not work with multiple core
parallel::mclapply(1:tr$n, function(x){
v <- getValues(r, tr$row[x], tr$nrows[x])
s <- writeValues(s,v,tr$row[x])
}, mc.cores = 2)
s <- writeStop(s)
So that does not work. I understand the logic, why it does not work.
My question now is: Suppose I have a rasterstack with 2 rasters. Could I use mclapply or another function from the parallel package to write this process differently. So I get the values of the block for both grids at the same time, but these values are only written to one rater per core.
For the solution I am looking for it is not acceptable to first get all values, safe them in an object and then write the values blockwise, because my rasters are to large.
I would be very happy if someone has a solution or just an idea or suggestion.
Thanks.
I believe the object returned by raster::writeStart() can only be processed in the same R process as it was created. That is, it is not possible for a parallel R process to work with it.
The fact that the object uses an external pointer internally is a strong indicator that it cannot be exported to another R process or saved to file or read back again. You can check for external pointers using (non-public) future:::assert_no_references(), e.g.
> library(raster)
> r <- raster(system.file("external/test.grd", package="raster"))
> future:::assert_no_references(r)
NULL ## == no external pointer
> s <- raster(r)
> future:::assert_no_references(s)
NULL ## == no external pointer
> s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
> future:::assert_no_references(s)
Error: Detected a non-exportable reference ('externalptr') in one of the globals (<unknown>) used in the future expression
I have a large corpus I'm doing transformations on with tm::tm_map(). Since I'm using hosted R Studio I have 15 cores and wanted to make use of parallel processing to speed things up.
Without sharing a very large corpus, I'm simply unable to reproduce with dummy data.
My code is below. Short descriptions of the problem is that looping over pieces manually in the console works but doing so within my functions does not.
Function "clean_corpus" takes a corpus as input, breaks it up into pieces and saves to a tempfile to help with ram issues. Then the function iterates over each piece using a %dopar% block. The function worked when testing on a small subset of the corpus e.g. 10k documents. But on larger corpus the function was returning NULL. To debug I set the function to return the individual pieces that had been looped over and not the re built corpus as a whole. I found that on smaller corpus samples the code would return a list of all mini corpus' as expected, but as I tested on larger samples of the corpus the function would return some NULLs.
Here's why this is baffling to me:
cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000) # works
cleaned.corpus <- clean_corpus(corpus.regular[10001:20000], n = 1000) # also works
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000) # NULL
If I do this in 10k blocks up to e.g. 50k via 5 iterations everything works. If I run the function on e.g. full 50k documents it returns NULL.
So, maybe I just need to loop over smaller pieces by breaking my corpus up more. I tried this. In the clean_corpus function below parameter n is the length of each piece. The function still returns NULL.
So, if I iterate like this:
# iterate over 10k docs in 10 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000)
If I do that 5 times manually up to 50K everything works. The equivalent of doing that in one call by my function is:
# iterate over 50K docs in 50 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000)
Returns NULL.
This SO post and the one linked to in the only answer suggested it might be to do with my hosted instance of RStudio on linux where linux "out of memory killer oom" might be stopping workers. This is why I tried breaking my corpus into pieces, to get around memory issues.
Any theories or suggestions as to why iterating over 10k documents in 10 chunks of 1k works whereas 50 chunks of 1k do not?
Here's the clean_corpus function:
clean_corpus <- function(corpus, n = 500000) { # n is length of each peice in parallel processing
# split the corpus into pieces for looping to get around memory issues with transformation
nr <- length(corpus)
pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
rm(corpus) # save memory
# save pieces to rds files since not enough RAM
tmpfile <- tempfile()
for (i in seq_len(lenp)) {
saveRDS(pieces[[i]],
paste0(tmpfile, i, ".rds"))
}
rm(pieces) # save memory
# doparallel
registerDoParallel(cores = 14) # I've experimented with 2:14 cores
pieces <- foreach(i = seq_len(lenp)) %dopar% {
piece <- readRDS(paste0(tmpfile, i, ".rds"))
# transformations
piece <- tm_map(piece, content_transformer(replace_abbreviation))
piece <- tm_map(piece, content_transformer(removeNumbers))
piece <- tm_map(piece, content_transformer(function(x, ...)
qdap::rm_stopwords(x, stopwords = tm::stopwords("en"), separate = F, strip = T, char.keep = c("-", ":", "/"))))
}
# combine the pieces back into one corpus
corpus <- do.call(function(...) c(..., recursive = TRUE), pieces)
return(corpus)
} # end clean_corpus function
Code blocks from above again just for flow of readability after typing function:
# iterate over 10k docs in 10 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000) # works
# iterate over 50K docs in 50 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000) # does not work
But iterating in console by calling the function on each of
corpus.regular[1:10000], corpus.regular[10001:20000], corpus.regular[20001:30000], corpus.regular[30001:40000], corpus.regular[40001:50000] # does work on each run
Note I tried using library tm functionality for parallel processing (see here) but I kept hitting "cannot allocate memory" errors which is why I tried to do it "on my own" using doparallel %dopar%.
Summary of solution from comments
Your memory issue is likely related to corpus <- do.call(function(...) c(..., recursive = TRUE), pieces) because this still stores all of your (output) data in memory
I recommended exporting your output from each worker to a file, such as a RDS or csv file, rather than collecting it into a single data structure at the end
An additional problem (as you pointed out) is that foreach will save the output of each worker with an implied return statement (the code block in {} after dopar is treated as a function). I recommended adding an explicit return(1) before the closing } to not save the intended output into memory (which you already explicitly saved as a file).
I am using the parallel library in R to process a large data set on which I am applying complex operations.
For the sake of providing a reproducible code, you can find below a simpler example:
#data generation
dir <- "C:/Users/things_to_process/"
setwd(dir)
for(i in 1:800)
{
my.matrix <- matrix(runif(100),ncol=10,nrow=10)
saveRDS(my.matrix,file=paste0(dir,"/matrix",i))
}
#worker function
worker.function <- function(files)
{
files.length <- length(files)
partial.results <- vector('list',files.length)
for(i in 1:files.length)
{
matrix <- readRDS(files[i])
partial.results[[i]] <- sum(diag(matrix))
}
Reduce('+',partial.results)
}
#master part
cl <- makeCluster(detectCores(), type = "PSOCK")
file_list <- list.files(path=dir,recursive=FALSE,full.names=TRUE)
part <- clusterSplit(cl,seq_along(file_list))
files.partitioned <- lapply(part,function(p) file_list[p])
results <- clusterApply(cl,files.partitioned,worker.function)
result <- Reduce('+',results)
Essentially, I am wondering if trying to read files in parallel would be done in an interleaved fashion instead. And if, as a result, this bottleneck would cut down on the expected performance of running tasks in parallel?
Would it be better if I first read all matrices at once in a list then sent chunks of this list to each core for it to be processed? what if these matrices were much larger, would I be able to load all of them in a list at once ?
Instead of saving each matrix in a separate RDS file, have you tried saving a list of N matrices in each file, where N is the number that is going to be processed by a single worker?
Then the worker.function looks like:
worker.function <- function(file) {
matrix_list <- readRDS(file)
partial_results <- lapply(matrix_list, function(mat) sum(diag(mat)))
Reduce('+',partial.results)
}
You should save some time on I/O and maybe even on computation by replacing a for with a lapply.
I have set of data (around 50000 data. and each one of them 1.5 mb). So, to load the data and process the data first I have used this code;
data <- list() # creates a list
listcsv <- dir(pattern = "*.txt") # creates the list of all the csv files in the directory
then I use for loop to load each data;
for (k in 1:length(listcsv)){
data[[k]]<- read.csv(listcsv[k],sep = "",as.is = TRUE, comment.char = "", skip=37);
my<- as.matrix(as.double(data[[k]][1:57600,2]));
print(ort_my);
a[k]<-ort_my;
write(a,file="D:/ddd/ads.txt",sep='\t',ncolumns=1)}
So, I set the program run but even if after 6 hours it didn't finished. Although I have a decent pc with a 32 GB ram and 6 core CPU.
I have searched the forum and maybe fread function would be helpful people say. However all the examples which I found so far deal with the single file reading with the fread function.
Can any one suggest me the solution of this problem for faster loop to read data and process it with these many rows and columns?
I am guessing there has to be a way to make the extraction of what you want more efficient. But I think running in parallel could save you a bunch of time. And save you memory by not storing each file.
library("data.table")
#Create function you want to eventually loop through in parallel
readFiles <- function(x) {
data <- fread(x,skip=37)
my <- as.matrix(data[1:57600,2,with=F]);
mesh <- array(my, dim = c(120,60,8));
Ms<-1350*10^3 # A/m
asd2=(mesh[70:75,24:36 ,2])/Ms; # in A/m
ort_my<- mean(asd2);
return(ort_my)
}
#R Code to run functions in parallel
library(“foreach”);library(“parallel”);library(“doMC”)
detectCores() #This will tell you how many cores are available
registerDoMC(8) #Register the parallel backend
#Can change .combine from rbind to list
OutputList <- foreach(listcsv,.combine=rbind,.packages=c(”data.table”)) %dopar% (readFiles(x))
registerDoSEQ() #Very important to close out parallel backend.
Is it possible to iterative over a single text file on a single multi-core machine in parallel with R? For context, the text file is somewhere between 250-400MB of JSON output.
EDIT:
Here are some code samples I have been playing around with. To my surprise, parallel processing did not win - just basic lapply - but this could be due to user error on my part. In addition, when trying to read a number of large files, my machine choked.
## test on first 100 rows of 1 twitter file
library(rjson)
library(parallel)
library(foreach)
library(plyr)
N = 100
library(rbenchmark)
mc.cores <- detectCores()
benchmark(lapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
llply(readLines(FILE, n=N, warn=FALSE), fromJSON),
mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON,
mc.cores=mc.cores),
foreach(x=readLines(FILE, n=N, warn=FALSE)) %do% fromJSON(x),
replications=100)
Here is a second code sample
parseData <- function(x) {
x <- tryCatch(fromJSON(x),
error=function(e) return(list())
)
## need to do a test to see if valid data, if so ,save out the files
if (!is.null(x$id_str)) {
x$created_at <- strptime(x$created_at,"%a %b %e %H:%M:%S %z %Y")
fname <- paste("rdata/",
format(x$created_at, "%m"),
format(x$created_at, "%d"),
format(x$created_at, "%Y"),
"_",
x$id_str,
sep="")
saveRDS(x, fname)
rm(x, fname)
gc(verbose=FALSE)
}
}
t3 <- system.time(lapply(readLines(FILES[1], n=-1, warn=FALSE), parseData))
The answer depends on what the problem actually is: reading the file in parallel, or processing the file in parallel.
Reading in parallel
You could split the JSON file into multiple input files and read them in parallel, e.g. using the plyr functions combined with a parallel backend:
result = ldply(list.files(pattern = ".json"), readJSON, .parallel = TRUE)
Registering a backend can probably be done using the parallel package which is now integrated in base R. Or you can use the doSNOW package, see this post on my blog for details.
Processing in parallel
In this scenario your best bet is to read the entire dataset into a vector of characters, split the data and then use a parallel backend combined with e.g. the plyr functions.
probably not with readLines() due to the nature of non-parallel file-system IO. Of course, if you're using a parallel NFS or something like HDFS, then this restriction won't apply. But assuming you're on a "standard" architecture, it won't be feasible to parallelize your readLine() calls.
Your best bet would probably be to read in the entire file seeing as <500MB will probably fit in memory, then parallelize the processing once you're object is already read in.