How to efficiently parallelize EXTRACT function in raster package R

How to efficiently parallelize EXTRACT function in raster package R - r

Given a netcdf file, I am trying to extract all pixels to form a data.frame for later export to .csv
a=brick(mew.nc)
#get coordinates
coord<-xyFromCell(a,1:ncell(a))
I can extract data for all pixels using extract(a,1:ncell(a)). However, I run into memory issues.
Upon reading through various help pages, I found that one can speed up things with:
beginCluster(n=30)
b=extract(a, coord)
endCluster()
But I still run out of memory. Our supercomputer has more than 1000 nodes, each node has 32 cores.
My actual rasterbrick has 400,000 layers
I am not sure how to parrallize this task without running into memory issues.
Thank you for all your suggestions.
Sample data of ~8MB can be found here

You can do something along these lines to avoid memory problems
library(raster)
b <- brick(system.file("external/rlogo.grd", package="raster"))
outfile <- 'out.csv'
if (file.exists(outfile)) file.remove(outfile)
tr <- blockSize(b)
b <- readStart(b)
for (i in 1:tr$n) {
v <- getValues(b, row=tr$row[i], nrows=tr$nrows[i])
write.table(v, outfile, sep = ",", row.names = FALSE, append = TRUE, col.names=!file.exists(outfile))
}
b <- readStop(b)
To parallelize, you could do this by layer, or groups of layers; and probably all values in one step for each subset of layers. Here for one layer at a time:
f <- function(d) {
filename <- extension(paste(names(d), collapse='-'), '.csv')
x <- values(d)
x <- matrix(x) # these two lines only needed when using
colnames(x) <- names(d) # a single layer
write.csv(x, filename, row.names=FALSE)
}
# parallelize this:
for (i in 1:nlayers(b)) {
f(b[[i]])
}
or
x <- sapply(1:nlayers(b), function(i) f(b[[i]]))
You should not be using extract. The question I have is what you would want such a large csv file for.

Related

Is there a faster way to mask multiple rasters using multiple polygons?

Basically I have 12 multispectral images, and I want to mask them using 2 polygons (small waterbodies). The 2 polygons are in one shapefile, but I can break them up if it would make the process easier. With the help of some nice users on here, I tested this all out using the 12 images on one polygon and it works just fine, but I'll eventually need to do this for multiple polygons so I want to adapt my code.
The loop to crop all rasters using a single polygon:
#The single polygon
mask <- st_read(here::here("data", "mask.shp") %>%
st_as_sf()
#Creates list of input files and their paths
crop_in <- list.files(here::here("data", "s2_rasters"), pattern="tif$", full.names=TRUE)
#Creates list of output files and their directory.
crop_out <- gsub(here::here("data", "s2_rasters"), here::here("data", "s2_cropped"), crop_in)
for (i in seq_along(crop_in)) {
b <- brick(crop_in[i])
crop(b, mask, filename = crop_out[i])
}
Like I said this works just fine, but I want to mask instead of crop. Additionally, I need to mask using multiple polygons.
My working loop to do the same thing but for multiple (2) polygons:
masks_2 <- st_read(here::here("data", "multiple_masks.shp")) %>%
st_as_sf()
for (i in seq_along(crop_in)) {
b <- brick(crop_in[i])
mask(b, masks_2, filename = crop_out[i], overwrite = TRUE)
}
This took around 2 hours (which makes me suspicious) and I think it lost the polygon id somewhere along the way. When I tried plotting the results the plot was empty. My final output should be 24 rasterstacks, 12 for each polygon. I will need to do further image analysis so I will need to keep the names. I hope this makes sense and thank you!

Here is a minimal, self-contained, reproducible example using terra because it is much faster than raster (make sure you are using the current version)
Raster dataset with 12 layers
library(terra)
f <- system.file("ex/elev.tif", package="terra")
r <- rast(f)
r <- rep(r, 12) * 1:12
names(r) <- paste0("band", 1:12)
Two "lakes"
v <- vect(system.file("ex/lux.shp", package="terra"))
v <- v[c(1,12)]
Solution:
x <- mask(r, v)
And always try things for a single case before running the loop.
So if you have 12 files, you can do something like
inf <- list.files("data/s2_rasters", pattern="tif$", full.names=TRUE)
outf <- gsub(".tif$", "_masked.tif", inf)
for (for i in 1:length(inf)) {
r <- rast(inf[i])
m <- mask(r, v, filename=outf[i])
}
It might be a little faster to instead do this (only rasterize the polygons once)
msk <- rast(inf[1])
msk <- rasterize(v, msk)
for (for i in 1:length(inf)) {
r <- rast(inf[i])
m <- mask(r, msk, filename=outf[i])
}
Or make one object/file, if that is practical.
rr <- rast(inf)
mm <- mask(rr, v)

Dealing with big datasets in R

I'm having a memory problem with R giving the Can not allocate vector of size XX Gb error message. I have a bunch of daily files (12784 days) in netcdf format giving sea surface temperature in a 1305x378 (longitude-latitude) grid. That gives 493290 points each day, decreasing to about 245000 when removing NAs (over land points).
My final objective is to build a time series for any of the 245000 points from the daily files and find the temporal trend for each point. And my idea was to build a big data frame with a point per row and a day per column (2450000x12784) so I could apply the trend calculation to any point. But then, building such data frame, the memory problem appeared, as expected.
First I tried a script I had previously used to read data and extract a three column (lon-lat-sst) dataframe by reading nc file and then melting the data. This lead to an excessive computing time when tried for a small set of days and to the memory problem. Then I tried to subset the daily files into longitudinal slices; this avoided the memory problem but the csv output files were too big and the process was very time consuming.
Another strategy I've tried without success to the moment it's been to sequentially read all the nc files and then extract all the daily values for each point and find the trend. Then I would only need to save a single 245000 points dataframe. But I think this would be time consuming and not the proper R way.
I have been reading about big.memory and ff packages to try to declare big.matrix or a 3D array (1305 x 378 x 12784) but had not success by now.
What would be the appropriate strategy to face the problem?
Extract single point time series to calculate individual trends and populate a smaller dataframe
Subset daily files in slices to avoid the memory problem but end with a lot of dataframes/files
Try to solve the memory problem with bigmemory or ff packages
Thanks in advance for your help
EDIT 1
Add code to fill the matrix
library(stringr)
library(ncdf4)
library(reshape2)
library(dplyr)
# paths
ruta_datos<-"/home/meteo/PROJECTES/VERSUS/CMEMS/DATA/SST/"
ruta_treball<-"/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL/"
setwd(ruta_treball)
sst_data_full <- function(inputfile) {
sstFile <- nc_open(inputfile)
sst_read <- list()
sst_read$lon <- ncvar_get(sstFile, "lon")
sst_read$lats <- ncvar_get(sstFile, "lat")
sst_read$sst <- ncvar_get(sstFile, "analysed_sst")
nc_close(sstFile)
sst_read
}
melt_sst <- function(L) {
dimnames(L$sst) <- list(lon = L$lon, lat = L$lats)
sst_read <- melt(L$sst, value.name = "sst")
}
# One month list file: This ends with a df of 245855 rows x 33 columns
files <- list.files(path = ruta_datos, pattern = "SST-CMEMS-198201")
sst.out=data.frame()
for (i in 1:length(files) ) {
sst<-sst_data_full(paste0(ruta_datos,files[i],sep=""))
msst <- melt_sst(sst)
msst<-subset(msst, !is.na(msst$sst))
if ( i == 1 ) {
sst.out<-msst
} else {
sst.out<-cbind(sst.out,msst$sst)
}
}
EDIT 2
Code used in a previous (smaller) data frame to calculate temporal trend. Original data was a matrix of temporal series, being each column a series.
library(forecast)
data<-read.csv(....)
for (i in 2:length(data)){
var<-paste("V",i,sep="")
ff<-data$fecha
valor<-data[,i]
datos2<-as.data.frame(cbind(data$fecha,valor))
datos.ts<-ts(datos2$valor, frequency = 365)
datos.stl <- stl(datos.ts,s.window = 365)
datos.tslm<-tslm(datos.ts ~ trend)
summary(datos.tslm)
output[i-1]<-datos.tslm$coefficients[2]
}
fecha is date variable name
EDIT 2
Working code from F. Privé answer
library(bigmemory)
tmp <- sst_data_full(paste0(ruta_datos,files[1],sep=""))
library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files),backingfile = "/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL" )
for (i in seq_along(files)) {
mat[, i] <- sst_data_full(paste0(ruta_datos,files[i],sep=""))$sst
}
With this code a big matrix was created
dim(mat)
[1] 493290 12783
mat[1,1]
[1] 293.05
mat[1,1:10]
[1] 293.05 293.06 292.98 292.96 292.96 293.00 292.97 292.99 292.89 292.97
ncol(mat)
[1] 12783
nrow(mat)
[1] 493290

So, to your read data in a Filebacked Big Matrix (FBM), you can do
files <- list.files(path = "SST-CMEMS", pattern = "SST-CMEMS-198201*",
full.names = TRUE)
tmp <- sst_data_full(files[1])
library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files))
for (i in seq_along(files)) {
mat[, i] <- sst_data_full(files[i])$sst
}

reading and processing files in parallel in R

I am using the parallel library in R to process a large data set on which I am applying complex operations.
For the sake of providing a reproducible code, you can find below a simpler example:
#data generation
dir <- "C:/Users/things_to_process/"
setwd(dir)
for(i in 1:800)
{
my.matrix <- matrix(runif(100),ncol=10,nrow=10)
saveRDS(my.matrix,file=paste0(dir,"/matrix",i))
}
#worker function
worker.function <- function(files)
{
files.length <- length(files)
partial.results <- vector('list',files.length)
for(i in 1:files.length)
{
matrix <- readRDS(files[i])
partial.results[[i]] <- sum(diag(matrix))
}
Reduce('+',partial.results)
}
#master part
cl <- makeCluster(detectCores(), type = "PSOCK")
file_list <- list.files(path=dir,recursive=FALSE,full.names=TRUE)
part <- clusterSplit(cl,seq_along(file_list))
files.partitioned <- lapply(part,function(p) file_list[p])
results <- clusterApply(cl,files.partitioned,worker.function)
result <- Reduce('+',results)
Essentially, I am wondering if trying to read files in parallel would be done in an interleaved fashion instead. And if, as a result, this bottleneck would cut down on the expected performance of running tasks in parallel?
Would it be better if I first read all matrices at once in a list then sent chunks of this list to each core for it to be processed? what if these matrices were much larger, would I be able to load all of them in a list at once ?

Instead of saving each matrix in a separate RDS file, have you tried saving a list of N matrices in each file, where N is the number that is going to be processed by a single worker?
Then the worker.function looks like:
worker.function <- function(file) {
matrix_list <- readRDS(file)
partial_results <- lapply(matrix_list, function(mat) sum(diag(mat)))
Reduce('+',partial.results)
}
You should save some time on I/O and maybe even on computation by replacing a for with a lapply.

r double loop too slow with large data

I need to read hundred of .bil files:(reproductive example)
d19810101 <- data.frame(ID=c(1:10),year=rep(1981,10),month=rep(1,10),day=rep(1,10),value=c(11:20))
d19810102 <- data.frame(ID=c(1:10),year=rep(1981,10),month=rep(1,10),day=rep(2,10),value=c(12:21))
d19820101 <- data.frame(ID=c(1:10),year=rep(1982,10),month=rep(1,10),day=rep(1,10),value=c(13:22))
d19820102 <- data.frame(ID=c(1:10),year=rep(1982,10),month=rep(1,10),day=rep(2,10),value=c(14:23))
The code I wrote for testing small amount files works ok but when I tried to run the entire files, it went super slow, please let me know if there is any way that I can improve. What I need to do is simply get the average of 33 years of daily data, here is the code for testing small amount of files:
years <- c(1981:1982)
days <- substr(as.numeric(format(seq(as.Date("1981/1/1"), as.Date("1981/1/2"), "day"), '%Y%m%d')),5,8)
X_Y <- NULL
for (j in days) {
for (i in years) {
XYi <- read.table(paste(i,substr(j,1,2),substr(j,3,4),".csv",sep=''),header=T,sep=",",stringsAsFactors=F)
X_Y <- rbind(X_Y, XYi)
cat(paste("Data in ", i, j, " are processing now.", sep=""), "\n")
}
library(plyr)
X_Y1 <- ddply(X_Y, .(ID, month, day), summarize, mean(value, na.rm=T))
cat(paste("Data in ", i, j, " are processing now.", sep=""), "\n")
}
EDIT:
Thank you for all your help! I tried putting the files in a list to read, but since its .bil files which needs to get the raster characteristics, thus I got error, that's why I need to read them one by one, sorry for didn't make it clear earlier
Read.files <- function(file.names, sep=",") {
library(raster)
ldply(file.names, function(fn) data.frame(Filename=fn, layer <- raster(fn, sep=",")))
}
data1 <- Read.files(paste("filenames here",days,".bil",sep=''), sep=",")
"Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class 'structure("RasterLayer", package = "raster")' into a data.frame.
EDIT 2:
The data structure of my data is actually same with the example data, only that my data is grid data and needs to be extracted(using raster function instead of read.csv), and then to be put into data frame, therefore I need to do the following steps:
for (i in days)
{
layer <- raster(paste("filename here",i,".bil",sep=''))
projection <- projection(layer)
cellsize <- res(layer)[1]
...
s <- resample(layer,r, method='ngb')
XY <- data.frame(rasterToPoints(s))
names(XY) <- c('Long','Lat','Data')
}

It's hard to tell exactly how your are managing file IO, but I think an easier way to achieve this would be to read the files in, put them into one data.frame (e.g. using rbind()), and then get the summary statistics you need via tapply():
data <- do.call(rbind, mget(ls(pattern = "d[0-9]*"))) # combine data
with(data, tapply(value, list(month, day), mean)) # get mean for each month and day combination
This assumes you have already read in all of the files, to objects named as in your example.

Merging multiple rasters in R

I've been trying to find a time-efficient way to merge multiple raster images in R. These are adjacent ASTER scenes from the southern Kilimanjaro region, and my target is to put them together to obtain one large image.
This is what I got so far (object 'ast14dmo' representing a list of RasterLayer objects):
# Loop through single ASTER scenes
for (i in seq(ast14dmo.sd)) {
if (i == 1) {
# Merge current with subsequent scene
ast14dmo.sd.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
} else if (i > 1 && i < length(ast14dmo.sd)) {
tmp.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
ast14dmo.sd.mrg <- merge(ast14dmo.sd.mrg, tmp.mrg, tolerance = 1)
} else {
# Save merged image
writeRaster(ast14dmo.sd.mrg, paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg", sep = ""), format = "GTiff", overwrite = TRUE)
}
}
As you surely guess, the code works. However, merging takes quite long considering that each single raster object is some 70 mb large. I also tried Reduce and do.call, but that failed since I couldn't pass the argument 'tolerance' which circumvents the different origins of the raster files.
Anybody got an idea of how to speed things up?

You can use do.call
ast14dmo.sd$tolerance <- 1
ast14dmo.sd$filename <- paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg.tif", sep = "")
ast14dmo.sd$overwrite <- TRUE
mm <- do.call(merge, ast14dmo.sd)
Here with some data, from the example in raster::merge
r1 <- raster(xmx=-150, ymn=60, ncols=30, nrows=30)
r1[] <- 1:ncell(r1)
r2 <- raster(xmn=-100, xmx=-50, ymx=50, ymn=30)
res(r2) <- c(xres(r1), yres(r1))
r2[] <- 1:ncell(r2)
x <- list(r1, r2)
names(x) <- c("x", "y")
x$filename <- 'test.tif'
x$overwrite <- TRUE
m <- do.call(merge, x)

The 'merge' function from the Raster package is a little slow. For large projects a faster option is to work with gdal commands in R.
library(gdalUtils)
library(rgdal)
Build list of all raster files you want to join (in your current working directory).
all_my_rasts <- c('r1.tif', 'r2.tif', 'r3.tif')
Make a template raster file to build onto. Think of this a big blank canvas to add tiles to.
e <- extent(-131, -124, 49, 53)
template <- raster(e)
projection(template) <- '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs'
writeRaster(template, file="MyBigNastyRasty.tif", format="GTiff")
Merge all raster tiles into one big raster.
mosaic_rasters(gdalfile=all_my_rasts,dst_dataset="MyBigNastyRasty.tif",of="GTiff")
gdalinfo("MyBigNastyRasty.tif")
This should work pretty well for speed (faster than merge in the raster package), but if you have thousands of tiles you might even want to look into building a vrt first.

You can use Reduce like this for example :
Reduce(function(...)merge(...,tolerance=1),ast14dmo.sd)

SAGA GIS mosaicking tool (http://www.saga-gis.org/saga_tool_doc/7.3.0/grid_tools_3.html) gives you maximum flexibility for merging numeric layers, and it runs in parallel by default! You only have to translate all rasters/images to SAGA .sgrd format first, then run the command line saga_cmd.

I have tested the solution using gdalUtils as proposed by Matthew Bayly. It works quite well and fast (I have about 1000 images to merge). However, after checking with document of mosaic_raster function here, I found that it works without making a template raster before mosaic the images. I pasted the example codes from the document below:
outdir <- tempdir()
gdal_setInstallation()
valid_install <- !is.null(getOption("gdalUtils_gdalPath"))
if(require(raster) && require(rgdal) && valid_install)
{
layer1 <- system.file("external/tahoe_lidar_bareearth.tif", package="gdalUtils")
layer2 <- system.file("external/tahoe_lidar_highesthit.tif", package="gdalUtils")
mosaic_rasters(gdalfile=c(layer1,layer2),dst_dataset=file.path(outdir,"test_mosaic.envi"),
separate=TRUE,of="ENVI",verbose=TRUE)
gdalinfo("test_mosaic.envi")
}

I was faced with this same problem and I used
#Read desired files into R
data_name1<-'file_name1.tif'
r1=raster(data_name1)
data_name2<-'file_name2.tif'
r2=raster(data_name2)
#Merge files
new_data <- raster::merge(r1, r2)
Although it did not produce a new merged raster file, it stored in the data environment and produced a merged map when plotted.

I ran into the following problem when trying to mosaic several rasters on top of each other
In vv[is.na(vv)] <- getValues(x[[i]])[is.na(vv)] :
number of items to replace is not a multiple of replacement length
As #Robert Hijmans pointed out, it was likely because of misaligned rasters. To work around this, I had to resample the rasters first
library(raster)
x <- raster("Base_raster.tif")
r1 <- raster("Top1_raster.tif")
r2 <- raster("Top2_raster.tif")
# Resample
x1 <- resample(r1, crop(x, r1))
x2 <- resample(r2, crop(x, r2))
# Merge rasters. Make sure to use the right order
m <- merge(merge(x1, x2), x)
# Write output
writeRaster(m,
filename = file.path("Mosaic_raster.tif"),
format = "GTiff",
overwrite = TRUE)