Rasterize polygons in R using snowfall & sfLapply - r

I would like to rasterize a very large vector file to 25m and have had some success with the 'cluster' package, adapting the qu's here and here, which worked nicely for that particular data.
However I now have a larger vector file that needs rasterizing and have access to a cluster that uses snowfall. I'm not used to cluster functions and i'm just not sure how to set up sfLapply. I am consistently getting the following sort of error as sfLapply is called in the cluster:
Error in checkForRemoteErrors(val) :
one node produced an error: 'quote(96)' is not a function, character or symbol
Calls: sfLapply ... clusterApply -> staticClusterApply -> checkForRemoteErrors
my full code:
library(snowfall)
library(rgeos)
library(maptools)
library(raster)
library(sp)
setwd("/home/dir/")
# Initialise the cluster...
hosts = as.character(read.table(Sys.getenv('PBS_NODEFILE'),header=FALSE)[,1]) # read the nodes to use
sfSetMaxCPUs(length(hosts)) # make sure the maximum allowed number of CPUs matches the number of hosts
sfInit(parallel=TRUE, type="SOCK", socketHosts=hosts, cpus=length(hosts), useRscript=TRUE) # initialise a socket cluster session with the named nodes
sfLibrary(snowfall)
# read in required data
shp <- readShapePoly("my_data.shp")
BNG <- "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs"
crs(shp) <- BNG
### rasterize the uniques to 25m and write (GB and clipped) ###
rw <- raster(res=c(25,25), xmn=0, xmx=600000, ymn=0, ymx=1000000, crs=BNG)
# Number of polygons features in SPDF
features <- 1:nrow(shp[,])
# Split features in n parts
n <- 96
parts <- split(features, cut(features, n))
rasFunction = function(X, shape, raster, nparts){
ras = rasterize(shape[nparts[[X]],], raster, 'CODE')
return(ras)
}
# Export everything in the workspace onto the cluster...
sfExportAll()
# Distribute calculation across the cluster nodes...
rDis = sfLapply(n, fun=rasFunction,X=n, shape=shp, raster=rw, nparts=parts) # equivalent of sapply
rMerge <- do.call(merge, rDis)
writeRaster(rMerge, filename="my_data_25m", format="GTiff", overwrite=TRUE)
# Stop the cluster...
sfStop()
i've tried a number of things, changing the function and sfLapply, but i just can't get this to run. thanks

Because I can't do formatting in comments:
library(maptools)
shp <- readShapePoly("my_data.shp")
BNG <- "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs"
shp.2 <- spTransform(shp, BNG)
#Continue as before
Overwriting a projection != reprojecting data.

Ok so I abandoned snowfall and I looked into gdalUtils::gdal_rasterize instead and found a lot of benefits to using it (with one downside that someone might be able to answer?)
Context & Issue: My vector data exist inside an ESRI file Geodatabase and require some processing pre rasterization. No problem, rgdal::readOGR is fine. However as gdal_rasterize requires a pathname to the vector data, i had trouble here because I could not write out my processed vector data, they exceed the max file size for a shapefile outside of a geodatabase and gdal_rasterize will not accept objects, paths to .gdbs or .Rdata/.rds files. How do I pass an object to gdal_rasterize??
So I wrote out the large shapefile in segments equal to number of processors.
Originally raster::rasterize was used as I could simply pass the vector object stored in memory to rasterize without the writing problem (though I would have liked to have it written), rasterizing this data to 25m. This took a pretty long time, even in parallel.
Solution: gdal_rasterize in parallel.
# gdal_rasterize in parallel
require(gdalUtils)
require(rgdal)
require(rgeos)
require(cluster)
require(parallel)
require(raster)
# read in vector data
shape <- readOGR("./mygdb.gdb", layer="mydata",stringsAsFactors=F)
## do all the vector processing etc ##
# split vector data into n parts, the same as number of processors (minus 1)
npar <- detectCores() - 1
features <- 1:nrow(shape[,])
parts <- split(features, cut(features, npar))
# write the vector parts out
for(n in 1:npar){
writeOGR(shape[parts[[n]],], ".\\parts", paste0("mydata_p",n), driver="ESRI Shapefile")
}
# set up and write a blank raster for gdal_rasterize for EACH vector segment created above
r <- raster(res=c(25,25), xmn=234000, xmx=261000, ymn=229000, ymx=256000, crs=projection(shape))
for(n in 1:npar){
writeRaster(r, filename=paste0(".\\gdal_p",n,".tif"), format="GTiff", overwrite=TRUE)
}
# set up cluster and pass required packages and objects to cluster
cl <- makeCluster(npar)
clusterEvalQ(cl, sapply(c('raster', 'gdalUtils',"rgdal"), require, char=TRUE))
clusterExport(cl, list("r","npar"))
# parallel apply the gdal_rasterize function against the vector parts that were written,
# same number as processors, against the pre-prepared rasters
parLapply(cl = cl, X = 1:npar, fun = function(x) gdal_rasterize(src_datasource=paste0(".\\parts\\mydata_p",x,".shp"),
dst_filename=paste0(".\\gdal_p",n,".tif"),b=1,a="code",verbose=F,output_Raster=T))
# There are now n rasters representing the n segments of the original vector file
# read in the rasters as a list, merge and write to a new tif.
s <- lapply(X=1:npar, function(x) raster(paste0(".\\gdal_p",n,".tif")))
s$filename <- "myras_final.tif"
do.call(merge,s)
stopCluster(cl)
The time (split 60% for vector reading/processing/writing & 40% for raster generation and rasterization) for the entire job in this code was around 9 times quicker than raster::rasterize in parallel.
Note: I tried this initially by splitting the vectors into n parts but creating only 1 blank raster. I then wrote to the same blank raster from all cluster nodes simultaneously but this corrupted the raster and made it unusable in R/Arc/anything (despite going through the function without error). Above is a more stable way, but n blank rasters have to be made instead of 1, increasing processing time, plus merging n rasters is extra processing.
caveat - raster::rasterize in parallel did not have writeRaster inside the rasterize function but rather as a separate line, which will have increased processing time in the original run due to storage to temp files etc.
EDIT: Why are the frequency tables from the raster from gdal_rasterize not the same as raster::rasterize? I mean with 100million cells i expect a bit of difference but for some codes it was a few 1000 cells different. I thought they both rasterized by centroid?

Related

How to streamline and speed up loop with getData package in R

I am trying to download high-resolution climate data for a bunch of lat/long coordinates, and combine them into a single dataframe. I've come up with a solution (below), but it will take forever with the large list of coordinates I have. I asked a related question on the GIS StackExchange to see if anyone knew of a better approach for downloading and merging the data, but I'm wondering if I could somehow just speed up the operation of the loop? Does anyone have any suggestions on how I might do that? Here is a reproducible example:
# Download and merge 0.5 minute MAT/MAP data from WorldClim for a list of lon/lat coordinates
# This is based on https://emilypiche.github.io/BIO381/raster.html
# Make a dataframe with coordinates
coords <- data.frame(Lon = c(-83.63, 149.12), Lat=c(10.39,-35.31))
# Load package
library(raster)
# Make an empty dataframe for dumping data into
coords3 <- data.frame(Lon=integer(), Lat=integer(), MAT_10=integer(), MAP_MM=integer())
# Get WorldClim data for all the coordinates, and dump into coords 3
for(i in seq_along(coords$Lon)) {
r <- getData("worldclim", var="bio", res=0.5, lon=coords[i,1], lat=coords[i,2]) # Download the tile containing the lat/lon
r <- r[[c(1,12)]] # Reduce the layers in the RasterStack to just the variables we want to look at (MAT*10 and MAP_mm)
names(r) <- c("MAT_10", "MAP_mm") # Rename the columns to something intelligible
points <- SpatialPoints(na.omit(coords[i,1:2]), proj4string = r#crs) #give lon,lat to SpatialPoints
values <- extract(r,points)
coords2 <- cbind.data.frame(coords[i,1:2],values)
coords3 <- rbind(coords3, coords2)
}
# Convert MAT*10 from WorldClim into MAT in Celcius
coords3$MAT_C <- coords3$MAT_10/10
Edit: Thanks to advice from Dave2e, I've first made a list, then put intermediate results in the list, and rbind it at the end. I haven't timed this yet to see how much faster it is than my original solution. If anyone has further suggestions on how to improve the speed, I'm all ears! Here is the new version:
coordsList <- list()
for(i in seq_along(coordinates$lon_stm)) {
r <- getData("worldclim", var="bio", res=0.5, lon=coordinates[i,7], lat=coordinates[i,6]) # Download the tile containing the lat/lon
r <- r[[c(1,12)]] # Reduce the layers in the RasterStack to just the variables we want to look at (MAT*10 and MAP_mm)
names(r) <- c("MAT_10", "MAP_mm") # Rename the columns to something intelligible
points <- SpatialPoints(na.omit(coordinates[i,7:6]), proj4string = r#crs) #give lon,lat to SpatialPoints
values <- extract(r,points)
coordsList[[i]] <- cbind.data.frame(coordinates[i,7:6],values)
}
coords_new <- bind_rows(coordsList)
Edit2: I used system.time() to time the execution of both of the above approaches. When I did the timing, I had already downloaded all of the data, so the download time isn't included in my time estimates. My first approach took 45.01 minutes, and the revised approach took 44.15 minutes, so I'm not really seeing a substantial time savings by doing it the latter way. Still open to advice on how to revise the code so I can improve the speed of the operations!

Why does cropping a raster stack changes the names of layers?

I'm processing yearly multilayer netCDF files with daily precipitation data from CHIRPS. I have the files for the whole world, each file about 1.2gb large. I need to calculate indices from the precipitation data for each cell in the raster for a specific region. In order to do that I'm trying to crop the files to get a rectangular shape above my area of interest using the raster R package.
This is the code I'm using, exemplary for the first file.
library(ncdf4)
library(raster)
library(rgdal)
# Crop extent
crop_extent <- as(raster::extent(79, 89, 25, 31), "SpatialPolygons")
proj4string(crop_extent) <- "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
# Set directory with original files
setwd("~/data")
# Read file
chirps81 <- stack("chirps-v2.0.1981.days_p05.nc")
chirps81crop <-crop(chirps1981, crop_extent)
# Write cropped file back to different folder
setwd("~/croppeddata")
writeRaster(chirps81crop, "chirps81crop.nc", overwrite=TRUE)
For some reason however while writing the file the layers lose their name. In the original files and after cropping the names have layer names of the format "X1981.01.01". But after writing and reading the netCDF file with new file <- stack("chirps81crop.nc") the layer names are changed to the format 'X1' up to 'X365'. I think it should be fine working with it, assuming that the order of the layers didn't get mixed up but I don't understand what is happening to the layer names and if this happens because there is something wrong with the code.
It's the writeRaster() function that is losing the layer names, not the crop operation. It is possible to use lower level ncdf functions to assign a numeric value (not a string unfortunately) to each layer which will then show up in the name of the layers after reading. Taking inspiration from the example here, I created some code that shows this.
library(ncdf4)
library(raster)
library(rgdal)
# Crop extent
crop_extent <- as(raster::extent(5.74, 5.75, 50.96, 50.97), "SpatialPolygons")
proj4string(crop_extent) <- "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
# make a sample file
r <- raster(system.file("external/test.grd", package="raster"))
r.latlon <- projectRaster(r, crs = proj4string(crop_extent))
writeRaster(x=r.latlon, filename = 'test.nc', format = 'CDF', overwrite=TRUE)
# read the sample as a 2 layer stack and crop it
test <- stack('test.nc', 'test.nc')
writeRaster(test, 'teststack.nc', overwrite=TRUE, format='CDF')
testcrop <- crop(test, crop_extent)
names(testcrop)
# [1] "test.1" "test.2"
# write the cropped file and make the zname equal to Layer
writeRaster(testcrop, 'testcrop.nc', overwrite=TRUE, format='CDF', zname='Layer')
# open the cdf file directly
nc <- nc_open('testcrop.nc', write = T)
# give the layers numbers starting from 10 so
# we can see them easily
layers = 1:nlayers(testcrop) + 10
layers
# [1] 11 12
ncvar_put(nc, 'Layer', layers)
nc_close(nc)
newtestcrop <- stack('testcrop.nc')
names(newtestcrop)
# [1] "X11" "X12"
nc <- nc_open('testcrop.nc', write = F)
layers = ncvar_get(nc, 'Layer')
layers
# [1] 11 12
nc_close(nc)
So it is possible to get names with numbers under your control when writing the raster, but I don't know enough about your environment to determine if this will help since it might be tricky to map the names you need to a single unambiguous number.
I hope you don't mind me offering a non-R solution, but this task is much easier from the command line using CDO:
cdo sellonlatbox,79,89,25,31 chirps-v2.0.1981.days_p05.nc cropped_file.nc
Which indices did you want to calculate? I suspect it is possible to calculate those quickly and easily with CDO functions too...

Extract raster by a list of SpatialPolygonsDataFrame objects in R

I am trying to extract summed raster cell values from a single big file for various SpatialPolygonsDataFrames (SPDF) objects in R stored in a list, then add the extracted values to the SPDF objects attribute tables. I would like to iterate this process, and have no idea how to do so. I have found an efficient solution for multiple polygons stored in a single SPDF object (see: https://gis.stackexchange.com/questions/130522/increasing-speed-of-crop-mask-extract-raster-by-many-polygons-in-r), but do not know how to apply the crop>mask>extract procedure to a LIST of SPDF objects, each containing multiple polygons. Here is a reproducible example:
library(maptools) ## For wrld_simpl
library(raster)
## Example SpatialPolygonsDataFrame
data(wrld_simpl) #polygon of world countries
bound <- wrld_simpl[1:25,] #country subset 1
bound2 <- wrld_simpl[26:36,] #subset 2
## Example RasterLayer
c <- raster(nrow=2e3, ncol=2e3, crs=proj4string(wrld_simpl), xmn=-180,
xmx=180, ymn=-90, ymx=90)
c[] <- 1:length(c)
#plot, so you can see it
plot(c)
plot(bound, add=TRUE)
plot(bound2, add=TRUE, col=3)
#make list of two SPDF objects
boundl<-list()
boundl[[1]]<-bound1
boundl[[2]]<-bound2
#confirm creation of SPDF list
boundl
The following is what I would like to run for the entire list, in a forloop format. For a single SPDF from the list, the following series of functions seem to work:
clip1 <- crop(c, extent(boundl[[1]])) #crops the raster to the extent of the polygon, I do this first because it speeds the mask up
clip2 <- mask(clip1, boundl[[1]]) #crops the raster to the polygon boundary
extract_clip <- extract(clip2, boundl[[1]], fun=sum)
#add column + extracted raster values to polygon dataframe
boundl[[1]]#data["newcolumn"] = extract_clip
But when I try to isolate the first function for the SPDF list (raster::crop), it does not return a raster object:
crop1 <- crop(c, extent(boundl[[1]])) #correctly returns object class 'RasterLayer'
cropl <- lapply(boundl, crop, c, extent(boundl)) #incorrectly returns objects of class 'SpatialPolygonsDataFrame'
When I try to isolate the mask function for the SPDF list (raster::mask), it returns an error:
maskl <- lapply(boundl, mask, c)
#Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘mask’ for signature ‘"SpatialPolygonsDataFrame", "RasterLayer"’
I would like to correct these errors, and efficiently iterate the entire procedure within a single loop (i.e., crop>mask>extract>add extracted values to SPDF attribute tables. I am really new to R and don't know where to go from here. Please help!
One approach is to take what is working and simply put the desired "crop -> mask -> extract -> add" into a for loop:
for(i in seq_along(boundl)) {
clip1 <- crop(c, extent(boundl[[i]]))
clip2 <- mask(clip1, boundl[[i]])
extract_clip <- extract(clip2, boundl[[i]], fun=sum)
boundl[[i]]#data["newcolumn"] <- extract_clip
}
One can speed-up the loop with parallel execution, e.g., with the R package foreach. Conversely, the speed gain of using lapply() instead of the for loop will be small.
Why the error occurs:
cropl <- lapply(boundl, crop, c, extent(boundl))
applies the function crop() to each element of the list boundl. The performed operation is
tmp <- crop(boundl[[1]], c)
## test if equal to first element
all.equal(cropl[[1]], tmp)
[1] TRUE
To get the desired result use
cropl <- lapply(boundl, function(x, c) crop(c, extent(x)), c=c)
## test if the first element is as expected
all.equal(cropl[[1]], crop(c, extent(boundl[[1]])))
[1] TRUE
Note:
Using c to denote an R object is a bade choice, because it can be easily confused with c().

Rasterizing and mosaicking a large (~15GB) CSV in parallel

I'm trying to rasterize a large CSV file which contains X and Y coordinates of raster pixels and an ID representing different individuals that have moved through the pixel.
I want 2 rasters from this operation: In the first, each pixel of the raster contains the number of times individuals have crossed it and in the second, each pixel contains the number of unique individuals that have crossed it.
I currently create rasters for each ID and then mosaic the rasters. I've parallelized this operation, using 16 cores and a CPU with 200 GB RAM. However, the CSV contains over a million rows and it seems a large overhead to copy it into each core. It also takes upwards of 4 days to complete and I have multiple such CSV files.
Is there a better, faster way to do this?
This is what I do currently:
n.cores <- 16 #this is half of the available cores
#Register CoreCluster
cl <- makeCluster(n.cores)
registerDoSNOW(cl)
clusterExport(cl=cl, "fil2" ) #fil2 is the CSV
#uq_id is a vector of all unique IDs
rasterList= foreach(i=1:length(uq_id), .combine='comb',
.init=list()) %dopar% {
library(raster)
#create a raster for each unique ID
dfXYID <- subset(fil2, uqid %in% i)
dfXYID$uqid <- as.numeric(dfXYID$uqid)
dfXYID=as.data.frame(dfXYID)
x <- raster(xmn=min(dfXYID$V1), xmx=max(dfXYID$V1), ymn=min(dfXYID$V2), ymx=max(dfXYID$V2), res=220, crs="+proj=utm +zone=43 +datum=WGS84 +units=m +no_defs")
#create a raster of the number of times individual with ID i have moved through a pixel
lenList <- rasterize(dfXYID[, c('V1', 'V2')], x, dfXYID[, 'uqid'], fun=function(x,...){length((na.omit(x)))})
return(lenList)
}
rasterList is then a list of rasters, each representing movement of individual i. I then have to do another time-consuming operation (can I do this in parallel within the above foreach?): apply a mosaic operation on this large list to get the 2 raster ouputs that I need.
I would be very grateful if anyone has any suggestions on how I can speed this up.
I understand gdal is an option, but I am constrained as I have to work with a CSV and I don't want to be writing each raster in rasterList to the disk to do further processing with gdal.

Merging multiple rasters in R

I've been trying to find a time-efficient way to merge multiple raster images in R. These are adjacent ASTER scenes from the southern Kilimanjaro region, and my target is to put them together to obtain one large image.
This is what I got so far (object 'ast14dmo' representing a list of RasterLayer objects):
# Loop through single ASTER scenes
for (i in seq(ast14dmo.sd)) {
if (i == 1) {
# Merge current with subsequent scene
ast14dmo.sd.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
} else if (i > 1 && i < length(ast14dmo.sd)) {
tmp.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
ast14dmo.sd.mrg <- merge(ast14dmo.sd.mrg, tmp.mrg, tolerance = 1)
} else {
# Save merged image
writeRaster(ast14dmo.sd.mrg, paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg", sep = ""), format = "GTiff", overwrite = TRUE)
}
}
As you surely guess, the code works. However, merging takes quite long considering that each single raster object is some 70 mb large. I also tried Reduce and do.call, but that failed since I couldn't pass the argument 'tolerance' which circumvents the different origins of the raster files.
Anybody got an idea of how to speed things up?
You can use do.call
ast14dmo.sd$tolerance <- 1
ast14dmo.sd$filename <- paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg.tif", sep = "")
ast14dmo.sd$overwrite <- TRUE
mm <- do.call(merge, ast14dmo.sd)
Here with some data, from the example in raster::merge
r1 <- raster(xmx=-150, ymn=60, ncols=30, nrows=30)
r1[] <- 1:ncell(r1)
r2 <- raster(xmn=-100, xmx=-50, ymx=50, ymn=30)
res(r2) <- c(xres(r1), yres(r1))
r2[] <- 1:ncell(r2)
x <- list(r1, r2)
names(x) <- c("x", "y")
x$filename <- 'test.tif'
x$overwrite <- TRUE
m <- do.call(merge, x)
The 'merge' function from the Raster package is a little slow. For large projects a faster option is to work with gdal commands in R.
library(gdalUtils)
library(rgdal)
Build list of all raster files you want to join (in your current working directory).
all_my_rasts <- c('r1.tif', 'r2.tif', 'r3.tif')
Make a template raster file to build onto. Think of this a big blank canvas to add tiles to.
e <- extent(-131, -124, 49, 53)
template <- raster(e)
projection(template) <- '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs'
writeRaster(template, file="MyBigNastyRasty.tif", format="GTiff")
Merge all raster tiles into one big raster.
mosaic_rasters(gdalfile=all_my_rasts,dst_dataset="MyBigNastyRasty.tif",of="GTiff")
gdalinfo("MyBigNastyRasty.tif")
This should work pretty well for speed (faster than merge in the raster package), but if you have thousands of tiles you might even want to look into building a vrt first.
You can use Reduce like this for example :
Reduce(function(...)merge(...,tolerance=1),ast14dmo.sd)
SAGA GIS mosaicking tool (http://www.saga-gis.org/saga_tool_doc/7.3.0/grid_tools_3.html) gives you maximum flexibility for merging numeric layers, and it runs in parallel by default! You only have to translate all rasters/images to SAGA .sgrd format first, then run the command line saga_cmd.
I have tested the solution using gdalUtils as proposed by Matthew Bayly. It works quite well and fast (I have about 1000 images to merge). However, after checking with document of mosaic_raster function here, I found that it works without making a template raster before mosaic the images. I pasted the example codes from the document below:
outdir <- tempdir()
gdal_setInstallation()
valid_install <- !is.null(getOption("gdalUtils_gdalPath"))
if(require(raster) && require(rgdal) && valid_install)
{
layer1 <- system.file("external/tahoe_lidar_bareearth.tif", package="gdalUtils")
layer2 <- system.file("external/tahoe_lidar_highesthit.tif", package="gdalUtils")
mosaic_rasters(gdalfile=c(layer1,layer2),dst_dataset=file.path(outdir,"test_mosaic.envi"),
separate=TRUE,of="ENVI",verbose=TRUE)
gdalinfo("test_mosaic.envi")
}
I was faced with this same problem and I used
#Read desired files into R
data_name1<-'file_name1.tif'
r1=raster(data_name1)
data_name2<-'file_name2.tif'
r2=raster(data_name2)
#Merge files
new_data <- raster::merge(r1, r2)
Although it did not produce a new merged raster file, it stored in the data environment and produced a merged map when plotted.
I ran into the following problem when trying to mosaic several rasters on top of each other
In vv[is.na(vv)] <- getValues(x[[i]])[is.na(vv)] :
number of items to replace is not a multiple of replacement length
As #Robert Hijmans pointed out, it was likely because of misaligned rasters. To work around this, I had to resample the rasters first
library(raster)
x <- raster("Base_raster.tif")
r1 <- raster("Top1_raster.tif")
r2 <- raster("Top2_raster.tif")
# Resample
x1 <- resample(r1, crop(x, r1))
x2 <- resample(r2, crop(x, r2))
# Merge rasters. Make sure to use the right order
m <- merge(merge(x1, x2), x)
# Write output
writeRaster(m,
filename = file.path("Mosaic_raster.tif"),
format = "GTiff",
overwrite = TRUE)

Resources