Parallel processing and for loop in R - r

I have a large number of netcdf files. Each of which is 300*300=90000 grids.
I tried to open each file in a loop, make all 90000 grids as a single column, open the next file and append it to the first column etc. hence I created a dataframe, where each column represents a netcdf file with 90000 rows.
The code is as follows.
files= list.files("C:/cygwin64/home/Suchi",pattern="3B-HHR.MS.MRG.3IMERG.2001",full.names=TRUE)
# Loop over files
for(i in 1:files) {
nc = ncdf4::nc_open(files[i])
lw = ncvar_get(nc,"pcp")
lw<-as.data.frame((lw))
lw<-as.data.frame(t(lw))
lw<-unlist((lw))
lw<-data.frame(lw)
# Add the values from each file to a single data.frame
cbind(df,data.frame(lw))->df
ncdf4::nc_close(nc)
}
The above code works fine. It is just taking too much time.
Please help me to do the same using foreach command in parallel processing.
I am getting the following error:
Error unlist(ncdf4::nc_open(files[i])) :
task 1 failed - "missing value where TRUE/FALSE needed"
When using foreach parallel processing..

I don't see your foreach loop, so I have made one for you. The error you are receiving may be due to the fact your loop is this:
for(i in 1:files)
which is wrong, since files is a vector not a number. It should instead be this:
for(i in 1:length(files))
Here is the foreach loop that I created, for your script. Let me know if this works:
library(parallel)
library(doParallel)
library(foreach)
files= list.files("C:/cygwin64/home/Suchi",pattern="3B-HHR.MS.MRG.3IMERG.2001",full.names=TRUE)
# Loop over files
cl = makeCluster(10)
registerDoParallel(cl)
foreach(i = 1:length(files)) %dopar% {
library(ncdf4)
nc = ncdf4::nc_open(files[i])
lw = ncvar_get(nc,"pcp")
lw<-as.data.frame((lw))
lw<-as.data.frame(t(lw))
lw<-unlist((lw))
lw<-data.frame(lw)
# Add the values from each file to a single data.frame
cbind(df,data.frame(lw))->df
ncdf4::nc_close(nc)
}
stopCluster(cl)

Related

problem with loops: creating names based on increasing values for reading in CSV file

Okay, so I needed to split a larger file into a bunch of CSV's to run through a non-R program. I used this loop to do it:
for(k in 1:14){
inner_edge = 2000000L*(k-1) + 1
outter_edge = 2000000L*(k)
part <- slice(nc_tu_CLEAN, inner_edge:outter_edge)
out_name = paste0("geo/geo_CLEAN",(k),".csv")
write_csv(part,out_name)
Sys.time()
}
which worked great. Except I'm having a problem in this other program, and need to read a bunch of these back in to trouble shoot. I tried to write this loop for it, and get the following error:
for(k in 1:6){
csv_name <- paste0("geo_CLEAN",(k),".csv")
geo_CLEAN_(k) <- fread(file= csv_name)
}
|--------------------------------------------------|
|==================================================|
Error in geo_CLEAN_(k) <- fread(file = csv_name) :
could not find function "geo_CLEAN_<-"
I know I could do this line by line, but I'd like to have that be a loop if possible. What I want is for geo_CLEAN_1 to relate to fread geoCLEAN1.csv; geo_CLEAN_2 to relate to fread geoCLEAN2.csv, etc.
We need assign if we are interested in creating objects
for(k in 1:6){
csv_name <- paste0("geo_CLEAN",(k),".csv")
assign(sub("\\.csv", "", csv_name), fread(file= csv_name))
}

R foreach do parallel does not end after loop completion

My goal is to do some operation on a dataframe as follows
exp_info <- data.frame(location.Id = 1:1e7,
x = rnorm(10))
For each location, I want to do the square of the x variable and write the individual file as csv. My actual computation is lengthier and has other stuffs so this is a simplistic example. This is how I am parallelising my task:
library(doParallel)
myClusters <- parallel::makeCluster(6)
doParallel::registerDoParallel(myClusters)
foreach(i = 1:nrow(exp_info),
.packages = c("dplyr","data.table"),
.errorhandling = 'remove',
.verbose = TRUE) %dopar%
{
rowRef <- exp_info[i, ]
rowRef <- rowRef %>% dplyr::mutate(x.sq = x^2)
fwrite(rowRef, paste0(i,'_iteration.csv'))
}
When I look at my working directory, I have all the individual csv files (1e7 csv files)
written out which says the above code is successful. However, my foreach loop does not end
even if all the files are written out and I have to kill the job which also does not generate any error. Does anyone have any idea why this could possibly happen?
I'm experiencing something similar. I don't know the answer but will add this: the same code and operation works on one computer, but fails to exit the for each loop on another computer. Hope this provides some direction.

R: using foreach to read csv data and apply functions over the data and export back to csv

I have 3 csv files, namely file1.csv, file2.csv and file3.csv.
Now for each of the file, I would like to import the csv and perform some functions over them and then export a transformed csv. So , 3 csv in and 3 transformed csv out. And there are just 3 independent tasks. So I thought I can try to use foreach %dopar%. Please not that I am using a Window machine.
However, I cannot get this to work.
library(foreach)
library(doParallel)
library(xts)
library(zoo)
numCores <- detectCores()
cl <- parallel::makeCluster(numCores)
doParallel::registerDoParallel(cl)
filenames <- c("file1.csv","file2.csv","file3.csv")
foreach(i = 1:3, .packages = c("xts","zoo")) %dopar%{
df_xts <- data_processing_IMPORT(filenames[i])
ddates <- unique(date(df_xts))
}
IF I comment out the last line ddates <- unique(date(df_xts)), the code runs fine with no error.
However, if I include the last line of code, I received the following error below, which I have no idea to get around. I tried to add .export = c("df_xts").
Error in { : task 1 failed - "unused argument (df_xts)"
It still doesn't work. I want to understand what's wrong with my logic and how should I get around this ? I am just trying to apply simple functions over the data only, I still haven't transformed the data and export them separately to csv. Yet I am already stuck.
The funny thing is I have written the simple code below, which works fine. Within the foreach, a is just like the df_xts above, being stored in a variable and passed into Fun2 to process. And the code below works fine. But above doesn't. I don't understand why.
numCores <- detectCores()
cl <- parallel::makeCluster(numCores)
doParallel::registerDoParallel(cl)
# Define the function
Fun1=function(x){
a=2*x
b=3*x
c=a+b
return(c)
}
Fun2=function(x){
a=2*x
b=3*x
c=a+b
return(c)
}
foreach(i = 1:10)%dopar%{
x <- rnorm(5)
a <- Fun1(x)
tst <- Fun2(a)
return(tst)
}
### Output: No error
parallel::stopCluster(cl)
Update: I have found out that the issue is with the date function there to extract the number of dates within the csv file but I am not sure how to get around this.
The use of foreach() is correct. You are using date() in ddates <- unique(date(df_xts)) but this function returns the current system time as POSIX and does not require any arguments. Therefore the argument error is regarding the date() function.
So i guess you want to use as.Date() instead or something similar.
ddates <- unique(as.Date(df_xts))
I've run into the same issue about reading, modifying and writing several CSV files. I tried to find a tidyverse solution for this, and while it doesn't really deal with the date problem above, here it is -- how to read, modify and write, several csv files using map from purrr.
library(tidyverse)
# There are some sample csv file in the "sample" dir.
# First get the paths of those.
datapath <- fs::dir_ls("./sample", regexp = ("csv"))
datapath
# Then read in the data, such as it is a list of data frames
# It seems simpler to write them back to disk as separate files.
# Another way to read them would be:
# newsampledata <- vroom::vroom(datapath, ";", id = "path")
# but this will return a DF and separating it to different files
# may be more complicated.
sampledata <- map(datapath, ~ read_delim(.x, ";"))
# Do some transformation of the data.
# Here I just alter the column names.
transformeddata <- sampledata %>%
map(rename_all, tolower)
# Then prepare to write new files
names(transformeddata) <- paste0("new-", basename(names(transformeddata)))
# Write the csv files and check if they are there
map2(transformeddata, names(transformeddata), ~ write.csv(.x, file = .y))
dir(pattern = "new-")

Error in { : task 1 failed - "error returned from C call" using ncvar_get (ncdf4 package) within foreach loop

I am trying to extract data from a .nc file. Since there are 7 variables in my file, I want to loop the ncvar_get function through all 7 using foreach.
Here is my code:
# EXTRACTING CLIMATE DATA FROM NETCDF4 FILE
library(dplyr)
library(data.table)
library(lubridate)
library(ncdf4)
library(parallel)
library(foreach)
library(doParallel)
# SET WORKING DIRECTORY
setwd('/storage/hpc/data/htnb4d/RIPS/UW_climate_data/')
# SETTING UP
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
# READING INPUT FILE
infile <- nc_open("force_SERC_8th.1979_2016.nc")
vars <- attributes(infile$var)$names
climvars <- vars[1:7]
# EXTRACTING INFORMATION OF STUDY DOMAIN:
tab <- read.csv('SDGridArea.csv', header = T)
point <- sort(unique(tab$PointID)) #6013 points in the study area
# EXTRACTING DATA (P, TMAX, TMIN, LW, SW AND RH):
clusterEvalQ(cl, {
library(ncdf4)
})
clusterExport(cl, c('infile','climvars','point'))
foreach(i = climvars) %dopar% {
climvar <- ncvar_get(infile, varid = i) # all data points 13650 points
dim <- dim(climvar)
climMX <- aperm(climvar,c(3,2,1))
dim(climMX) <- c(dim[3],dim[1]*dim[2])
climdt <- data.frame(climMX[,point]) #getting 6013 points in the study area
write.table(climdt,paste0('SD',i,'daily.csv'), sep = ',', row.names = F)
}
stopCluster(cl)
And the error is:
Error in { : task 1 failed - "error returned from C call"
Calls: %dopar% -> <Anonymous>
Execution halted
Could you please explain what is wrong with this code? I assume it has something to do with the fact that the cluster couldn't find out which variable to get from the file, since 'error returned from C call' usually comes from ncvar_get varid argument.
I had the same problem (identical error message) running a similar R script on my MacBook Pro (OSX 10.12.5). The problem seems to be that the different workers from the foreach loop try to access the same .nc file at the same time with ncvar_get. This can be solved by using ncvar_get outside the foreach loop (storing all the data in a big array) and accessing that array from within the foreach loop.
Obviously, another solution would be to appropriately split up the .nc file before and then accessing the different .nc files from within the foreach loop. This should lower memory consumption since copying of the big array to each worker is avoided.
I had the same issue on a recently acquired work machine. However, the same code runs fine on my home server.
The difference is that on my server I build the netCDF libraries with parallel access enabled (which requires HDF5 compiled with some MPI compiler).
I suspect this feature can prevent the OP's error from happening.
EDIT:
In order to have NetCDF with paralel I/O, first you need to build HDF5 with the following arguments:
./configure --prefix=/opt/software CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx FC=/usr/bin/mpifort
And then, when building the NetCDF C and Fortran libraries, you can also enable tests with the parallel I/O to make sure everything works fine:
./configure -prefix=/opt/software --enable-parallel-tests CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx (C version)
./configure --prefix=/opt/software --enable-parallel-tests CC=/usr/bin/mpicc FC=/usr/bin/mpifort F77=/usr/bin/mpifort (Fortran version)
Of course, in order to do that you need to have some kind of MPI library (MPICH, OpenMPI) installed on your computer.

Make function and apply to read data in R?

I have set of data (around 50000 data. and each one of them 1.5 mb). So, to load the data and process the data first I have used this code;
data <- list() # creates a list
listcsv <- dir(pattern = "*.txt") # creates the list of all the csv files in the directory
then I use for loop to load each data;
for (k in 1:length(listcsv)){
data[[k]]<- read.csv(listcsv[k],sep = "",as.is = TRUE, comment.char = "", skip=37);
my<- as.matrix(as.double(data[[k]][1:57600,2]));
print(ort_my);
a[k]<-ort_my;
write(a,file="D:/ddd/ads.txt",sep='\t',ncolumns=1)}
So, I set the program run but even if after 6 hours it didn't finished. Although I have a decent pc with a 32 GB ram and 6 core CPU.
I have searched the forum and maybe fread function would be helpful people say. However all the examples which I found so far deal with the single file reading with the fread function.
Can any one suggest me the solution of this problem for faster loop to read data and process it with these many rows and columns?
I am guessing there has to be a way to make the extraction of what you want more efficient. But I think running in parallel could save you a bunch of time. And save you memory by not storing each file.
library("data.table")
#Create function you want to eventually loop through in parallel
readFiles <- function(x) {
data <- fread(x,skip=37)
my <- as.matrix(data[1:57600,2,with=F]);
mesh <- array(my, dim = c(120,60,8));
Ms<-1350*10^3 # A/m
asd2=(mesh[70:75,24:36 ,2])/Ms; # in A/m
ort_my<- mean(asd2);
return(ort_my)
}
#R Code to run functions in parallel
library(“foreach”);library(“parallel”);library(“doMC”)
detectCores() #This will tell you how many cores are available
registerDoMC(8) #Register the parallel backend
#Can change .combine from rbind to list
OutputList <- foreach(listcsv,.combine=rbind,.packages=c(”data.table”)) %dopar% (readFiles(x))
registerDoSEQ() #Very important to close out parallel backend.

Resources