I have an R function that loads, processes, and saves many files. Here is a dummy version:
load_process_saveFiles <- function(onlyFiles = c()){
allFiles <- paste(LETTERS, '.csv', sep = '')
# If desired, only include certain files
if(length(onlyFiles) > 0){
allFiles <- allFiles[allFiles %in% onlyFiles]
}
for(file in allFiles){
# load file
rawFile <- file
# Run a super long function
processedFile <- rawFile
# Save file
# write.csv(processedFile, paste('./Other/Path/', file, sep = ''), row.names = FALSE)
cat('\nDone with file ', file, sep = '')
}
}
It has to run through about 30 files, and each one takes about 3 minutes. It can be very time consuming to loop through the entire thing. What I'd like to do is run each one separately at the same time so that it would take 3 minutes all together instead of 3 x 30 = 90 minutes.
I know I can achieve this by creating a bunch of RStudio sessions or many terminal tabs, but I can't handle having that many sessions or tabs open at once.
Ideally, I'd like to have all of the files with separate functions listed in one batchRun.R file which I can run from the terminal:
source('./PathToFunction/load_process_saveFiles.R')
load_process_saveFiles(onlyFiles = 'A.csv')
load_process_saveFiles(onlyFiles = 'B.csv')
load_process_saveFiles(onlyFiles = 'C.csv')
load_process_saveFiles(onlyFiles = 'D.csv')
load_process_saveFiles(onlyFiles = 'E.csv')
load_process_saveFiles(onlyFiles = 'F.csv')
So then run $ RScript batchRun.R from the terminal.
I've tried looking up different examples on SO trying to accomplish something similar, but each have some unique features and I just can't get it to work. Is what I'm trying to do possible? Thanks!
Package parallel gives you a number of options. One option is to parallelize the calls to load_process_saveFiles and have the loop inside of the function run serially. Another option is to parallelize the loop and have the calls run serially. The best way to assess which approach is more suitable for your job is to time them both yourself.
Evaluating the calls to load_process_saveFiles in parallel is relatively straightforward with mclapply, the parallel version of the base function lapply (see ?lapply):
parallel::mclapply(x, load_process_saveFiles, mc.cores = 2L)
Here, x is a list of values of the argument onlyFiles, and mc.cores = 2L indicates that you want to divide the calls among two R processes.
Evaluating the loop inside of load_process_saveFiles in parallel would involve replacing the entire for statement with something like
f <- function(file) {
cat("Processing file", file, "...")
x <- read(file)
y <- process(x)
write(y, file = file.path("path", "to", file))
cat(" done!\n")
}
parallel::mclapply(allFiles, f, ...)
and redefining load_process_saveFiles to allow optional arguments:
load_process_saveFiles <- function(onlyFiles = character(0L), ...) {
## body
}
Then you could do, for example, load_process_saveFiles(onlyFiles, mc.cores = 2L).
I should point out that mclapply is not supported on Windows. On Windows, you can use parLapply instead, but there are some extra steps involved. These are described in the parallel vignette, which can be opened from R with vignette("parallel", "parallel"). The vignette acts as general introduction to parallelism in R, so it could be worth reading anyway.
Parallel package is useful in this case. And if you are using Linux OS, I would recommend doMC package instead of parallel. This doMC package is useful even for looping over big data used in machine learning projects.
Related
My goal is to do some operation on a dataframe as follows
exp_info <- data.frame(location.Id = 1:1e7,
x = rnorm(10))
For each location, I want to do the square of the x variable and write the individual file as csv. My actual computation is lengthier and has other stuffs so this is a simplistic example. This is how I am parallelising my task:
library(doParallel)
myClusters <- parallel::makeCluster(6)
doParallel::registerDoParallel(myClusters)
foreach(i = 1:nrow(exp_info),
.packages = c("dplyr","data.table"),
.errorhandling = 'remove',
.verbose = TRUE) %dopar%
{
rowRef <- exp_info[i, ]
rowRef <- rowRef %>% dplyr::mutate(x.sq = x^2)
fwrite(rowRef, paste0(i,'_iteration.csv'))
}
When I look at my working directory, I have all the individual csv files (1e7 csv files)
written out which says the above code is successful. However, my foreach loop does not end
even if all the files are written out and I have to kill the job which also does not generate any error. Does anyone have any idea why this could possibly happen?
I'm experiencing something similar. I don't know the answer but will add this: the same code and operation works on one computer, but fails to exit the for each loop on another computer. Hope this provides some direction.
I am trying to extract data from a .nc file. Since there are 7 variables in my file, I want to loop the ncvar_get function through all 7 using foreach.
Here is my code:
# EXTRACTING CLIMATE DATA FROM NETCDF4 FILE
library(dplyr)
library(data.table)
library(lubridate)
library(ncdf4)
library(parallel)
library(foreach)
library(doParallel)
# SET WORKING DIRECTORY
setwd('/storage/hpc/data/htnb4d/RIPS/UW_climate_data/')
# SETTING UP
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
# READING INPUT FILE
infile <- nc_open("force_SERC_8th.1979_2016.nc")
vars <- attributes(infile$var)$names
climvars <- vars[1:7]
# EXTRACTING INFORMATION OF STUDY DOMAIN:
tab <- read.csv('SDGridArea.csv', header = T)
point <- sort(unique(tab$PointID)) #6013 points in the study area
# EXTRACTING DATA (P, TMAX, TMIN, LW, SW AND RH):
clusterEvalQ(cl, {
library(ncdf4)
})
clusterExport(cl, c('infile','climvars','point'))
foreach(i = climvars) %dopar% {
climvar <- ncvar_get(infile, varid = i) # all data points 13650 points
dim <- dim(climvar)
climMX <- aperm(climvar,c(3,2,1))
dim(climMX) <- c(dim[3],dim[1]*dim[2])
climdt <- data.frame(climMX[,point]) #getting 6013 points in the study area
write.table(climdt,paste0('SD',i,'daily.csv'), sep = ',', row.names = F)
}
stopCluster(cl)
And the error is:
Error in { : task 1 failed - "error returned from C call"
Calls: %dopar% -> <Anonymous>
Execution halted
Could you please explain what is wrong with this code? I assume it has something to do with the fact that the cluster couldn't find out which variable to get from the file, since 'error returned from C call' usually comes from ncvar_get varid argument.
I had the same problem (identical error message) running a similar R script on my MacBook Pro (OSX 10.12.5). The problem seems to be that the different workers from the foreach loop try to access the same .nc file at the same time with ncvar_get. This can be solved by using ncvar_get outside the foreach loop (storing all the data in a big array) and accessing that array from within the foreach loop.
Obviously, another solution would be to appropriately split up the .nc file before and then accessing the different .nc files from within the foreach loop. This should lower memory consumption since copying of the big array to each worker is avoided.
I had the same issue on a recently acquired work machine. However, the same code runs fine on my home server.
The difference is that on my server I build the netCDF libraries with parallel access enabled (which requires HDF5 compiled with some MPI compiler).
I suspect this feature can prevent the OP's error from happening.
EDIT:
In order to have NetCDF with paralel I/O, first you need to build HDF5 with the following arguments:
./configure --prefix=/opt/software CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx FC=/usr/bin/mpifort
And then, when building the NetCDF C and Fortran libraries, you can also enable tests with the parallel I/O to make sure everything works fine:
./configure -prefix=/opt/software --enable-parallel-tests CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx (C version)
./configure --prefix=/opt/software --enable-parallel-tests CC=/usr/bin/mpicc FC=/usr/bin/mpifort F77=/usr/bin/mpifort (Fortran version)
Of course, in order to do that you need to have some kind of MPI library (MPICH, OpenMPI) installed on your computer.
I have set of data (around 50000 data. and each one of them 1.5 mb). So, to load the data and process the data first I have used this code;
data <- list() # creates a list
listcsv <- dir(pattern = "*.txt") # creates the list of all the csv files in the directory
then I use for loop to load each data;
for (k in 1:length(listcsv)){
data[[k]]<- read.csv(listcsv[k],sep = "",as.is = TRUE, comment.char = "", skip=37);
my<- as.matrix(as.double(data[[k]][1:57600,2]));
print(ort_my);
a[k]<-ort_my;
write(a,file="D:/ddd/ads.txt",sep='\t',ncolumns=1)}
So, I set the program run but even if after 6 hours it didn't finished. Although I have a decent pc with a 32 GB ram and 6 core CPU.
I have searched the forum and maybe fread function would be helpful people say. However all the examples which I found so far deal with the single file reading with the fread function.
Can any one suggest me the solution of this problem for faster loop to read data and process it with these many rows and columns?
I am guessing there has to be a way to make the extraction of what you want more efficient. But I think running in parallel could save you a bunch of time. And save you memory by not storing each file.
library("data.table")
#Create function you want to eventually loop through in parallel
readFiles <- function(x) {
data <- fread(x,skip=37)
my <- as.matrix(data[1:57600,2,with=F]);
mesh <- array(my, dim = c(120,60,8));
Ms<-1350*10^3 # A/m
asd2=(mesh[70:75,24:36 ,2])/Ms; # in A/m
ort_my<- mean(asd2);
return(ort_my)
}
#R Code to run functions in parallel
library(“foreach”);library(“parallel”);library(“doMC”)
detectCores() #This will tell you how many cores are available
registerDoMC(8) #Register the parallel backend
#Can change .combine from rbind to list
OutputList <- foreach(listcsv,.combine=rbind,.packages=c(”data.table”)) %dopar% (readFiles(x))
registerDoSEQ() #Very important to close out parallel backend.
I am trying to quit and restart R from within R. The reason for this is that my job takes up a lot of memory, and none of the common options for cleaning R's workspace reclaim RAM taken up by R. gc(), closeAllConnections(), rm(list = ls(all = TRUE)) clear the workspace, but when I examine the processes in the Windows Task Manager, R's usage of RAM remains the same. The memory is reclaimed when R session is restarted.
I have tried the suggestion from this post:
Quit and restart a clean R session from within R?
but it doesn't work on my machine. It closes R, but doesn't open it again. I am running R x64 3.0.2 through RGui (64-bit) on Windows 7. Perhaps it is just a simple tweak of the first line in the above post:
makeActiveBinding("refresh", function() { shell("Rgui"); q("no") }, .GlobalEnv)
but I am unsure how it needs to be changed.
Here is the code. It is not fully reproducible, because a large list of files is needed that are read in and scraped. What eats memory is the scrape.func(); everything else is pretty small. In the code, I apply the scrape function to all files in one folder. Eventually, I would like to apply to a set of folders, each with a large number of files (~ 12,000 per folder; 50+ folders). Doing so at present is impossible, since R runs out of memory pretty quickly.
library(XML)
library(R.utils)
## define scraper function
scrape.func <- function(file.name){
require(XML)
## read in (zipped) html file
txt <- readLines(gunzip(file.name))
## parse html
doc <- htmlTreeParse(txt, useInternalNodes = TRUE)
## extract information
top.data <- xpathSApply(doc, "//td[#valign='top']", xmlValue)
id <- top.data[which(top.data=="I.D.:") + 1]
pub.date <- top.data[which(top.data=="Data publicarii:") + 1]
doc.type <- top.data[which(top.data=="Tipul documentului:") + 1]
## tie into dataframe
df <- data.frame(
id, pub.date, doc.type, stringsAsFactors=F)
return(df)
# clean up
closeAllConnections()
rm(txt)
rm(top.data)
rm(doc)
gc()
}
## where to store the scraped data
file.create("/extract.top.data.2008.1.csv")
## extract the list of files from the target folder
write(list.files(path = "/2008/01"),
file = "/list.files.2008.1.txt")
## count the number of files
length.list <- length(readLines("/list.files.2008.1.txt"))
length.list <- length.list - 1
## read in filename by filename and scrape
for (i in 0:length.list){
## read in line by line
line <- scan("/list.files.2008.1.txt", '',
skip = i, nlines = 1, sep = '\n', quiet = TRUE)
## catch the full path
filename <- paste0("/2008/01/", as.character(line))
## scrape
data <- scrape.func(filename)
## append output to results file
write.table(data,file = /extract.top.data.2008.1.csv",
append = TRUE, sep = ",", col.names = FALSE)
## rezip the html
filename2 <- sub(".gz","",filename)
gzip(filename2)
}
Many thanks in advance,
Marko
I also did some webscraping and ran directily into the same problem like u and it turned me crazy. Although im running a mordern OS (windows 10), the memory is still not released from time to time. after having a look at R FAQ I went for CleanMem, here u can set an automated memory cleaner at every 5 minutes or so. be sure to use
rm(list = ls())
gc()
closeAllConnections()
before so that R releases the memory.
Then use CleanMem so that the OS will notice there's free memory.
Is it possible to iterative over a single text file on a single multi-core machine in parallel with R? For context, the text file is somewhere between 250-400MB of JSON output.
EDIT:
Here are some code samples I have been playing around with. To my surprise, parallel processing did not win - just basic lapply - but this could be due to user error on my part. In addition, when trying to read a number of large files, my machine choked.
## test on first 100 rows of 1 twitter file
library(rjson)
library(parallel)
library(foreach)
library(plyr)
N = 100
library(rbenchmark)
mc.cores <- detectCores()
benchmark(lapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
llply(readLines(FILE, n=N, warn=FALSE), fromJSON),
mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON,
mc.cores=mc.cores),
foreach(x=readLines(FILE, n=N, warn=FALSE)) %do% fromJSON(x),
replications=100)
Here is a second code sample
parseData <- function(x) {
x <- tryCatch(fromJSON(x),
error=function(e) return(list())
)
## need to do a test to see if valid data, if so ,save out the files
if (!is.null(x$id_str)) {
x$created_at <- strptime(x$created_at,"%a %b %e %H:%M:%S %z %Y")
fname <- paste("rdata/",
format(x$created_at, "%m"),
format(x$created_at, "%d"),
format(x$created_at, "%Y"),
"_",
x$id_str,
sep="")
saveRDS(x, fname)
rm(x, fname)
gc(verbose=FALSE)
}
}
t3 <- system.time(lapply(readLines(FILES[1], n=-1, warn=FALSE), parseData))
The answer depends on what the problem actually is: reading the file in parallel, or processing the file in parallel.
Reading in parallel
You could split the JSON file into multiple input files and read them in parallel, e.g. using the plyr functions combined with a parallel backend:
result = ldply(list.files(pattern = ".json"), readJSON, .parallel = TRUE)
Registering a backend can probably be done using the parallel package which is now integrated in base R. Or you can use the doSNOW package, see this post on my blog for details.
Processing in parallel
In this scenario your best bet is to read the entire dataset into a vector of characters, split the data and then use a parallel backend combined with e.g. the plyr functions.
probably not with readLines() due to the nature of non-parallel file-system IO. Of course, if you're using a parallel NFS or something like HDFS, then this restriction won't apply. But assuming you're on a "standard" architecture, it won't be feasible to parallelize your readLine() calls.
Your best bet would probably be to read in the entire file seeing as <500MB will probably fit in memory, then parallelize the processing once you're object is already read in.