Skip error in lapply and continue processing ncdf4 files in R - r

I submitted an R script on LINUX HPC using the "sub" script. I have written a functionin R to apply to a list. However, it stops running once it encounters a bad file. How do I write the R function in such a way that it skips the error and continues on the good netcdf files? the script:
##list files in the SEVIRI data folder
LST1<-list.files(pattern="GT_SSD.*\\.nc",recursive=T, path="/data atsr/SEVIRI/2007")
##Function to create rasters
fun2<-function(x){
##Open the files
y1<-nc_open(x)
##Get soil moisture variable
y2<-ncvar_get( y1,"LST")
y3<-t(y2)
R1<-raster(y3, xmn=-80,xmx=80,ymn=-42,ymx=80)
proj4string(R1)<-CRS("+proj=longlat +ellps=WGS84")
frm <- extent(c(-19, 19,2,29))
pfrm <- as(frm, 'SpatialPolygons')
R3<-crop(R1,pfrm)}
When I apply the function
LST2<-lapply(LST1,fun2)
The error message is:
Error in nc_open(x) :
Error in nc_open trying to open file GT_SEV_2P/GT_SSD- L2-SEVIR_LST_2-20110122_010000-LIPM-0.05X0.05-V1.0.nc
The script stops running once this happen. How do I ensure it keeps running on the good ones, please? The codes above are just the first set of codes.

Here is an example with try. Note that I much simplified your function. I cannot be sure about this, as I do not have your data, but this much more direct approach works in most cases. You certainly do not need to create a SpatialPolygons object for use in crop.
fun2 <- function(x, ext) {
R1 <- try(raster(x, var="LST"), silent=TRUE)
if (class(R1) == 'try-error') {
return(NA)
}
frm <- extent(c(-19, 19, 2, 29))
crop(R1, frm)
}
x <- lapply(LST1, fun2)

Related

Understanding writeValues of raster by parallel processing. Is it possible to writevalues for each raster while using mclapply fork cluster. R

I try to understand how to parallelize raster processing in R. My Goal ist to parallize the following on multiple cores with multiple rasters.
I process my raster blockwise and i try to parallelize it with mclapply or other functions. First i want to get the values of one raster or a rasterstack. and then i want to write the values to the object. When i am using multiple cores, it does not work, because different sub Processes want to write on the same time. Somebody know a solution for that?
So here is the process:
get and create data
r <- raster(system.file("external/test.grd", package="raster"))
s <- raster(r)
tr <- blockSize(r)
then getValues and writevalues with a for loop
s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
for (i in 1:tr$n) {
v <- getValuesBlock(r, row=tr$row[i], nrows=tr$nrows[i])
s <- writeValues(s, v, tr$row[i])
}
s <- writeStop(s)
this works fine
now trying the same on lapply
s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
#working with lapply
lapply(1:tr$n, function(x){
v <- getValues(r, tr$row[x], tr$nrows[x])
s <- writeValues(s,v,tr$row[x])
})
s <- writeStop(s)
works fine
Now trying with mclapply with one core
s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
#does work with mclapply one core
parallel::mclapply(1:tr$n, function(x){
v <- getValues(r, tr$row[x], tr$nrows[x])
s <- writeValues(s,v,tr$row[x])
}, mc.cores = 1)
s <- writeStop(s)
also works
now trying with mclapply on multiple cores
s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
#does not work with multiple core
parallel::mclapply(1:tr$n, function(x){
v <- getValues(r, tr$row[x], tr$nrows[x])
s <- writeValues(s,v,tr$row[x])
}, mc.cores = 2)
s <- writeStop(s)
So that does not work. I understand the logic, why it does not work.
My question now is: Suppose I have a rasterstack with 2 rasters. Could I use mclapply or another function from the parallel package to write this process differently. So I get the values of the block for both grids at the same time, but these values are only written to one rater per core.
For the solution I am looking for it is not acceptable to first get all values, safe them in an object and then write the values blockwise, because my rasters are to large.
I would be very happy if someone has a solution or just an idea or suggestion.
Thanks.
I believe the object returned by raster::writeStart() can only be processed in the same R process as it was created. That is, it is not possible for a parallel R process to work with it.
The fact that the object uses an external pointer internally is a strong indicator that it cannot be exported to another R process or saved to file or read back again. You can check for external pointers using (non-public) future:::assert_no_references(), e.g.
> library(raster)
> r <- raster(system.file("external/test.grd", package="raster"))
> future:::assert_no_references(r)
NULL ## == no external pointer
> s <- raster(r)
> future:::assert_no_references(s)
NULL ## == no external pointer
> s <- writeStart(s[[1]], filename='test.grd', overwrite=TRUE)
> future:::assert_no_references(s)
Error: Detected a non-exportable reference ('externalptr') in one of the globals (<unknown>) used in the future expression

R: using foreach to read csv data and apply functions over the data and export back to csv

I have 3 csv files, namely file1.csv, file2.csv and file3.csv.
Now for each of the file, I would like to import the csv and perform some functions over them and then export a transformed csv. So , 3 csv in and 3 transformed csv out. And there are just 3 independent tasks. So I thought I can try to use foreach %dopar%. Please not that I am using a Window machine.
However, I cannot get this to work.
library(foreach)
library(doParallel)
library(xts)
library(zoo)
numCores <- detectCores()
cl <- parallel::makeCluster(numCores)
doParallel::registerDoParallel(cl)
filenames <- c("file1.csv","file2.csv","file3.csv")
foreach(i = 1:3, .packages = c("xts","zoo")) %dopar%{
df_xts <- data_processing_IMPORT(filenames[i])
ddates <- unique(date(df_xts))
}
IF I comment out the last line ddates <- unique(date(df_xts)), the code runs fine with no error.
However, if I include the last line of code, I received the following error below, which I have no idea to get around. I tried to add .export = c("df_xts").
Error in { : task 1 failed - "unused argument (df_xts)"
It still doesn't work. I want to understand what's wrong with my logic and how should I get around this ? I am just trying to apply simple functions over the data only, I still haven't transformed the data and export them separately to csv. Yet I am already stuck.
The funny thing is I have written the simple code below, which works fine. Within the foreach, a is just like the df_xts above, being stored in a variable and passed into Fun2 to process. And the code below works fine. But above doesn't. I don't understand why.
numCores <- detectCores()
cl <- parallel::makeCluster(numCores)
doParallel::registerDoParallel(cl)
# Define the function
Fun1=function(x){
a=2*x
b=3*x
c=a+b
return(c)
}
Fun2=function(x){
a=2*x
b=3*x
c=a+b
return(c)
}
foreach(i = 1:10)%dopar%{
x <- rnorm(5)
a <- Fun1(x)
tst <- Fun2(a)
return(tst)
}
### Output: No error
parallel::stopCluster(cl)
Update: I have found out that the issue is with the date function there to extract the number of dates within the csv file but I am not sure how to get around this.
The use of foreach() is correct. You are using date() in ddates <- unique(date(df_xts)) but this function returns the current system time as POSIX and does not require any arguments. Therefore the argument error is regarding the date() function.
So i guess you want to use as.Date() instead or something similar.
ddates <- unique(as.Date(df_xts))
I've run into the same issue about reading, modifying and writing several CSV files. I tried to find a tidyverse solution for this, and while it doesn't really deal with the date problem above, here it is -- how to read, modify and write, several csv files using map from purrr.
library(tidyverse)
# There are some sample csv file in the "sample" dir.
# First get the paths of those.
datapath <- fs::dir_ls("./sample", regexp = ("csv"))
datapath
# Then read in the data, such as it is a list of data frames
# It seems simpler to write them back to disk as separate files.
# Another way to read them would be:
# newsampledata <- vroom::vroom(datapath, ";", id = "path")
# but this will return a DF and separating it to different files
# may be more complicated.
sampledata <- map(datapath, ~ read_delim(.x, ";"))
# Do some transformation of the data.
# Here I just alter the column names.
transformeddata <- sampledata %>%
map(rename_all, tolower)
# Then prepare to write new files
names(transformeddata) <- paste0("new-", basename(names(transformeddata)))
# Write the csv files and check if they are there
map2(transformeddata, names(transformeddata), ~ write.csv(.x, file = .y))
dir(pattern = "new-")

Error in if (xn == xx) { : missing value where TRUE/FALSE needed

I'm trying to combine a large number of raster tiles to a single mosaic using R codes as follows. The error that appears is:
Error in if (xn == xx) { : missing value where TRUE/FALSE needed
The error appears after the for loop.
I will highly appreciate your suggestion.
require(raster)
rasters1 <- list.files("D:/lidar_grid_metrics/ElevMax",
pattern="*.asc$", full.names=TRUE, recursive=TRUE)
rast.list <- list()
for(i in 1:length(rasters1)) { rast.list[i] <- raster(rasters1[i]) }
rast.list$fun <- mean
rast.mosaic <- do.call(mosaic,rast.list)
plot(rast.mosaic)
First a better way to write what you do (use lapply)
library(raster)
ff <- list.files("D:/lidar_grid_metrics/ElevMax",
pattern="\\.asc$", full.names=TRUE, recursive=TRUE)
rast.list <- lapply(ff, raster)
rast.list$fun <- mean
rast.mosaic <- do.call(mosaic,rast.list)
Now, to the error your get. It is useful to show the results of traceback() after the error occurs. But from the error message you get, I infer that one of the RasterLayers has an extent with an NA value. That makes it invalid. You can check if that is true (and if so figure out what is going on) by doing
t(sapply(rast.list, function(i) as.vector(extent(i))))
EDIT
With the files Ram send me I figured out what was going on. There was a bug when creating a RasterLayer from an ascii file with the native driver if the file specifies "xllcenter" rather than "xllcorner".
This is now fixed on the development version (2.9-1) available on github.
The problem can also be avoided by installing rgdal because if rgdal is available, the native driver won't be used.

Error in { : task 1 failed - "error returned from C call" using ncvar_get (ncdf4 package) within foreach loop

I am trying to extract data from a .nc file. Since there are 7 variables in my file, I want to loop the ncvar_get function through all 7 using foreach.
Here is my code:
# EXTRACTING CLIMATE DATA FROM NETCDF4 FILE
library(dplyr)
library(data.table)
library(lubridate)
library(ncdf4)
library(parallel)
library(foreach)
library(doParallel)
# SET WORKING DIRECTORY
setwd('/storage/hpc/data/htnb4d/RIPS/UW_climate_data/')
# SETTING UP
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
# READING INPUT FILE
infile <- nc_open("force_SERC_8th.1979_2016.nc")
vars <- attributes(infile$var)$names
climvars <- vars[1:7]
# EXTRACTING INFORMATION OF STUDY DOMAIN:
tab <- read.csv('SDGridArea.csv', header = T)
point <- sort(unique(tab$PointID)) #6013 points in the study area
# EXTRACTING DATA (P, TMAX, TMIN, LW, SW AND RH):
clusterEvalQ(cl, {
library(ncdf4)
})
clusterExport(cl, c('infile','climvars','point'))
foreach(i = climvars) %dopar% {
climvar <- ncvar_get(infile, varid = i) # all data points 13650 points
dim <- dim(climvar)
climMX <- aperm(climvar,c(3,2,1))
dim(climMX) <- c(dim[3],dim[1]*dim[2])
climdt <- data.frame(climMX[,point]) #getting 6013 points in the study area
write.table(climdt,paste0('SD',i,'daily.csv'), sep = ',', row.names = F)
}
stopCluster(cl)
And the error is:
Error in { : task 1 failed - "error returned from C call"
Calls: %dopar% -> <Anonymous>
Execution halted
Could you please explain what is wrong with this code? I assume it has something to do with the fact that the cluster couldn't find out which variable to get from the file, since 'error returned from C call' usually comes from ncvar_get varid argument.
I had the same problem (identical error message) running a similar R script on my MacBook Pro (OSX 10.12.5). The problem seems to be that the different workers from the foreach loop try to access the same .nc file at the same time with ncvar_get. This can be solved by using ncvar_get outside the foreach loop (storing all the data in a big array) and accessing that array from within the foreach loop.
Obviously, another solution would be to appropriately split up the .nc file before and then accessing the different .nc files from within the foreach loop. This should lower memory consumption since copying of the big array to each worker is avoided.
I had the same issue on a recently acquired work machine. However, the same code runs fine on my home server.
The difference is that on my server I build the netCDF libraries with parallel access enabled (which requires HDF5 compiled with some MPI compiler).
I suspect this feature can prevent the OP's error from happening.
EDIT:
In order to have NetCDF with paralel I/O, first you need to build HDF5 with the following arguments:
./configure --prefix=/opt/software CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx FC=/usr/bin/mpifort
And then, when building the NetCDF C and Fortran libraries, you can also enable tests with the parallel I/O to make sure everything works fine:
./configure -prefix=/opt/software --enable-parallel-tests CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx (C version)
./configure --prefix=/opt/software --enable-parallel-tests CC=/usr/bin/mpicc FC=/usr/bin/mpifort F77=/usr/bin/mpifort (Fortran version)
Of course, in order to do that you need to have some kind of MPI library (MPICH, OpenMPI) installed on your computer.

How to add multiple .nc files to a rasterStack in R

I am trying to create a large rasterStack in R. I have 255 .nc files in a directory. So far I have the following code:
files = list.files(pattern = "*.nc")
st<- stack()
for (i in 1:length(files)) {
r<-raster(files[i], level = 1, crs = newproj, varname = "SWE" )
st<- addLayer(r)
}
When I run the code outside of a for loop with only one file, it works fine, but when I run it with the for loop (trying to add every file to the stack, I get this error:
Error in sapply(x, fromDisk) & sapply(x, inMemory) :
operations are possible only for numeric, logical or complex types
If someone could explain the error to me and where I am going wrong, that would be awesome!
Try this: replace st<- addLayer(r) with st<- addLayer(st, r).

Resources