Merging multiple rasters in R - r

I've been trying to find a time-efficient way to merge multiple raster images in R. These are adjacent ASTER scenes from the southern Kilimanjaro region, and my target is to put them together to obtain one large image.
This is what I got so far (object 'ast14dmo' representing a list of RasterLayer objects):
# Loop through single ASTER scenes
for (i in seq(ast14dmo.sd)) {
if (i == 1) {
# Merge current with subsequent scene
ast14dmo.sd.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
} else if (i > 1 && i < length(ast14dmo.sd)) {
tmp.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
ast14dmo.sd.mrg <- merge(ast14dmo.sd.mrg, tmp.mrg, tolerance = 1)
} else {
# Save merged image
writeRaster(ast14dmo.sd.mrg, paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg", sep = ""), format = "GTiff", overwrite = TRUE)
}
}
As you surely guess, the code works. However, merging takes quite long considering that each single raster object is some 70 mb large. I also tried Reduce and do.call, but that failed since I couldn't pass the argument 'tolerance' which circumvents the different origins of the raster files.
Anybody got an idea of how to speed things up?

You can use do.call
ast14dmo.sd$tolerance <- 1
ast14dmo.sd$filename <- paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg.tif", sep = "")
ast14dmo.sd$overwrite <- TRUE
mm <- do.call(merge, ast14dmo.sd)
Here with some data, from the example in raster::merge
r1 <- raster(xmx=-150, ymn=60, ncols=30, nrows=30)
r1[] <- 1:ncell(r1)
r2 <- raster(xmn=-100, xmx=-50, ymx=50, ymn=30)
res(r2) <- c(xres(r1), yres(r1))
r2[] <- 1:ncell(r2)
x <- list(r1, r2)
names(x) <- c("x", "y")
x$filename <- 'test.tif'
x$overwrite <- TRUE
m <- do.call(merge, x)

The 'merge' function from the Raster package is a little slow. For large projects a faster option is to work with gdal commands in R.
library(gdalUtils)
library(rgdal)
Build list of all raster files you want to join (in your current working directory).
all_my_rasts <- c('r1.tif', 'r2.tif', 'r3.tif')
Make a template raster file to build onto. Think of this a big blank canvas to add tiles to.
e <- extent(-131, -124, 49, 53)
template <- raster(e)
projection(template) <- '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs'
writeRaster(template, file="MyBigNastyRasty.tif", format="GTiff")
Merge all raster tiles into one big raster.
mosaic_rasters(gdalfile=all_my_rasts,dst_dataset="MyBigNastyRasty.tif",of="GTiff")
gdalinfo("MyBigNastyRasty.tif")
This should work pretty well for speed (faster than merge in the raster package), but if you have thousands of tiles you might even want to look into building a vrt first.

You can use Reduce like this for example :
Reduce(function(...)merge(...,tolerance=1),ast14dmo.sd)

SAGA GIS mosaicking tool (http://www.saga-gis.org/saga_tool_doc/7.3.0/grid_tools_3.html) gives you maximum flexibility for merging numeric layers, and it runs in parallel by default! You only have to translate all rasters/images to SAGA .sgrd format first, then run the command line saga_cmd.

I have tested the solution using gdalUtils as proposed by Matthew Bayly. It works quite well and fast (I have about 1000 images to merge). However, after checking with document of mosaic_raster function here, I found that it works without making a template raster before mosaic the images. I pasted the example codes from the document below:
outdir <- tempdir()
gdal_setInstallation()
valid_install <- !is.null(getOption("gdalUtils_gdalPath"))
if(require(raster) && require(rgdal) && valid_install)
{
layer1 <- system.file("external/tahoe_lidar_bareearth.tif", package="gdalUtils")
layer2 <- system.file("external/tahoe_lidar_highesthit.tif", package="gdalUtils")
mosaic_rasters(gdalfile=c(layer1,layer2),dst_dataset=file.path(outdir,"test_mosaic.envi"),
separate=TRUE,of="ENVI",verbose=TRUE)
gdalinfo("test_mosaic.envi")
}

I was faced with this same problem and I used
#Read desired files into R
data_name1<-'file_name1.tif'
r1=raster(data_name1)
data_name2<-'file_name2.tif'
r2=raster(data_name2)
#Merge files
new_data <- raster::merge(r1, r2)
Although it did not produce a new merged raster file, it stored in the data environment and produced a merged map when plotted.

I ran into the following problem when trying to mosaic several rasters on top of each other
In vv[is.na(vv)] <- getValues(x[[i]])[is.na(vv)] :
number of items to replace is not a multiple of replacement length
As #Robert Hijmans pointed out, it was likely because of misaligned rasters. To work around this, I had to resample the rasters first
library(raster)
x <- raster("Base_raster.tif")
r1 <- raster("Top1_raster.tif")
r2 <- raster("Top2_raster.tif")
# Resample
x1 <- resample(r1, crop(x, r1))
x2 <- resample(r2, crop(x, r2))
# Merge rasters. Make sure to use the right order
m <- merge(merge(x1, x2), x)
# Write output
writeRaster(m,
filename = file.path("Mosaic_raster.tif"),
format = "GTiff",
overwrite = TRUE)

Related

Is there a faster way to mask multiple rasters using multiple polygons?

Basically I have 12 multispectral images, and I want to mask them using 2 polygons (small waterbodies). The 2 polygons are in one shapefile, but I can break them up if it would make the process easier. With the help of some nice users on here, I tested this all out using the 12 images on one polygon and it works just fine, but I'll eventually need to do this for multiple polygons so I want to adapt my code.
The loop to crop all rasters using a single polygon:
#The single polygon
mask <- st_read(here::here("data", "mask.shp") %>%
st_as_sf()
#Creates list of input files and their paths
crop_in <- list.files(here::here("data", "s2_rasters"), pattern="tif$", full.names=TRUE)
#Creates list of output files and their directory.
crop_out <- gsub(here::here("data", "s2_rasters"), here::here("data", "s2_cropped"), crop_in)
for (i in seq_along(crop_in)) {
b <- brick(crop_in[i])
crop(b, mask, filename = crop_out[i])
}
Like I said this works just fine, but I want to mask instead of crop. Additionally, I need to mask using multiple polygons.
My working loop to do the same thing but for multiple (2) polygons:
masks_2 <- st_read(here::here("data", "multiple_masks.shp")) %>%
st_as_sf()
for (i in seq_along(crop_in)) {
b <- brick(crop_in[i])
mask(b, masks_2, filename = crop_out[i], overwrite = TRUE)
}
This took around 2 hours (which makes me suspicious) and I think it lost the polygon id somewhere along the way. When I tried plotting the results the plot was empty. My final output should be 24 rasterstacks, 12 for each polygon. I will need to do further image analysis so I will need to keep the names. I hope this makes sense and thank you!
Here is a minimal, self-contained, reproducible example using terra because it is much faster than raster (make sure you are using the current version)
Raster dataset with 12 layers
library(terra)
f <- system.file("ex/elev.tif", package="terra")
r <- rast(f)
r <- rep(r, 12) * 1:12
names(r) <- paste0("band", 1:12)
Two "lakes"
v <- vect(system.file("ex/lux.shp", package="terra"))
v <- v[c(1,12)]
Solution:
x <- mask(r, v)
And always try things for a single case before running the loop.
So if you have 12 files, you can do something like
inf <- list.files("data/s2_rasters", pattern="tif$", full.names=TRUE)
outf <- gsub(".tif$", "_masked.tif", inf)
for (for i in 1:length(inf)) {
r <- rast(inf[i])
m <- mask(r, v, filename=outf[i])
}
It might be a little faster to instead do this (only rasterize the polygons once)
msk <- rast(inf[1])
msk <- rasterize(v, msk)
for (for i in 1:length(inf)) {
r <- rast(inf[i])
m <- mask(r, msk, filename=outf[i])
}
Or make one object/file, if that is practical.
rr <- rast(inf)
mm <- mask(rr, v)

How to preserve raster dataType in raster processing?

When doing raster math, for example raster1-raster2, the datatype of the output raster is 'FLT4S', even if the datatype ot both raster1 and raster 2 is 'INT2S'. How can I force the output to be 'INT2S', without writing to disk? Is there a global way of doing it saying that all raster processing shall result in INT2S data?
The reason for wanting 'INT2S' instead of 'FLT4S' is to save memory space and speed up processing when using for loops on larger raster datasets.
In rasterOptions() one can specify dataType, but as far as I understand that only applies when writing to disk, right?
#load package raster
require (raster)
#create sample rasters
r1<-raster::raster(ext=extent(c(0,10,0,10)), res=1, vals=1:100)
r2<-raster::raster(ext=extent(c(0,10,0,10)), res=1, vals=100:1)
#set dataType of sample rasters to 'INT2S'
dataType(r1)<-'INT2S'
dataType(r2)<-'INT2S'
#check dataType of sample rasters
dataType(r1)
dataType(r2)
#do some simple arithmetics
r3<-r2-r1
#check the dataType of the output raster
dataType(r3)
I would like dataType(r3) to be 'INT2S' as well
In your example, it actually does work. Notice the L after the numbers if the vals argument. That is equivalent to as.integer
library(raster)
r1 <-raster::raster(ext=extent(c(0,10,0,10)), res=1, vals=1:100L)
r2 <-raster::raster(ext=extent(c(0,10,0,10)), res=1, vals=100:1L)
r3 <- r2 - r1
class(values(r3))
#[1] "integer"
But in other cases it does not work
class(values(r3 - 2L))
[1] "numeric"
And you cannot control that behavior.
Note that dataType provides information about the file that the RasterLayer refers to. If there is no file, like in the above example, the value is meaningless. You should also not set it, except for debugging when dealing with an existing file.
So the best you can do is set the datatype when writing to disk as in
writeRaster(r3, filename="test.tif", datatype="INT2S")
I have the same problem with 'INT2U' data and I don't believe it's possible. AFAIK, R supports 'numeric'. Floating point can be coerced to 'integer' with 'as.integer()' but I think it's just truncated.
I've had this same question too. Hopefully someone will chime in if I'm wrong, but I believe you really only have one option and that is to convert the output when you write to disk:
writeRaster(r3,filename="r3.tif", format="GTiff", datatype='INT2S')
then load back into R.
You can convert the output after the calculation, but it does not change the size of the object:
dataType(r3)<-'INT2S'
You can check the object size using:
object.size(r3)
If you write the data to disk and reload the object size will be smaller.
Another option is to use the calc or overlay functions in raster to do some math, save to disk and return the resulting raster in one go. I wrote the example rasters to disk to start since Robert said you should not assign it and in real cases you will probably be reading from disk.
# load package raster
library(raster)
# create sample rasters on disk
r1 <- raster::raster(ext = extent(c(0, 10, 0, 10)), res = 1, vals = 1:100)
r2 <- raster::raster(ext = extent(c(0, 10, 0, 10)), res = 1, vals = 100:1)
r1_fl <- tempfile(fileext = ".tif")
r2_fl <- tempfile(fileext = ".tif")
r3_fl <- tempfile(fileext = ".tif")
writeRaster(r1, r1_fl, datatype = "INT2S")
writeRaster(r2, r2_fl, datatype = "INT2S")
# read rasters from disk
r1 <- raster(r1_fl)
r2 <- raster(r2_fl)
# check dataType of sample rasters
dataType(r1)
dataType(r2)
# do some simple (or complex) arithmetic
r3 <- overlay(r1, r2, fun = function(x, y){x - y}, filename = r3_fl,
datatype = "INT2S")
# check the dataType of the output raster
dataType(r3)

How to efficiently parallelize EXTRACT function in raster package R

Given a netcdf file, I am trying to extract all pixels to form a data.frame for later export to .csv
a=brick(mew.nc)
#get coordinates
coord<-xyFromCell(a,1:ncell(a))
I can extract data for all pixels using extract(a,1:ncell(a)). However, I run into memory issues.
Upon reading through various help pages, I found that one can speed up things with:
beginCluster(n=30)
b=extract(a, coord)
endCluster()
But I still run out of memory. Our supercomputer has more than 1000 nodes, each node has 32 cores.
My actual rasterbrick has 400,000 layers
I am not sure how to parrallize this task without running into memory issues.
Thank you for all your suggestions.
Sample data of ~8MB can be found here
You can do something along these lines to avoid memory problems
library(raster)
b <- brick(system.file("external/rlogo.grd", package="raster"))
outfile <- 'out.csv'
if (file.exists(outfile)) file.remove(outfile)
tr <- blockSize(b)
b <- readStart(b)
for (i in 1:tr$n) {
v <- getValues(b, row=tr$row[i], nrows=tr$nrows[i])
write.table(v, outfile, sep = ",", row.names = FALSE, append = TRUE, col.names=!file.exists(outfile))
}
b <- readStop(b)
To parallelize, you could do this by layer, or groups of layers; and probably all values in one step for each subset of layers. Here for one layer at a time:
f <- function(d) {
filename <- extension(paste(names(d), collapse='-'), '.csv')
x <- values(d)
x <- matrix(x) # these two lines only needed when using
colnames(x) <- names(d) # a single layer
write.csv(x, filename, row.names=FALSE)
}
# parallelize this:
for (i in 1:nlayers(b)) {
f(b[[i]])
}
or
x <- sapply(1:nlayers(b), function(i) f(b[[i]]))
You should not be using extract. The question I have is what you would want such a large csv file for.

Parallelize st_union from R's sf package

I have some large shapefiles with multiple millions of polygons that I need to dissolve. Depending upon the shapefile I need to either dissolve by group or just use st_union for all. I have been using the st_par function and it has been working great for most sf applications. Though when I use this function on st_union it returns a list and I cannot figure out how to parallize the sf dissolve function st_union.
Any suggestions would be most helpful! Here is a small code snippet to illustrate my point.
library(sf)
library(assertthat)
library(parallel)
us_shp <- "data/cb_2016_us_state_20m/cb_2016_us_state_20m.shp"
if (!file.exists(us_shp)) {
loc <- "https://www2.census.gov/geo/tiger/GENZ2016/shp/cb_2016_us_state_20m.zip"
dest <- paste0("data/cb_2016_us_state_20m", ".zip")
download.file(loc, dest)
unzip(dest, exdir = "data/cb_2016_us_state_20m")
unlink(dest)
assert_that(file.exists(us_shp))
}
usa <- st_read("data/cb_2016_us_state_20m/cb_2016_us_state_20m.shp", quiet= TRUE) %>%
filter(!(STUSPS %in% c("AK", "HI", "PR")))
test <- usa %>%
st_par(., st_union, n_cores = 2)
I think you can solve your specific problem with a small modification of the original st_par function.
However this is just a quick and bold fix and this might broke the code for other uses of the function.
The author of the function could certainly provide a better fix...
library(parallel)
# Paralise any simple features analysis.
st_par <- function(sf_df, sf_func, n_cores, ...){
# Create a vector to split the data set up by.
split_vector <- rep(1:n_cores, each = nrow(sf_df) / n_cores, length.out = nrow(sf_df))
# Perform GIS analysis
split_results <- split(sf_df, split_vector) %>%
mclapply(function(x) sf_func(x), mc.cores = n_cores)
# Combine results back together. Method of combining depends on the output from the function.
if ( length(class(split_results[[1]]))>1 | class(split_results[[1]])[1] == 'list' ){
result <- do.call("c", split_results)
names(result) <- NULL
} else {
result <- do.call("rbind", split_results)
}
# Return result
return(result)
}
I was trying to use this for st_join and was running into problems with the returned data type. In looking at the result more closely it became evident that the split_results was just a list of sf objects. I ended up modifying the code to use dplyr::bind_rows() to get what I wanted.
There probably needs to be some more logic around the "combine" to deal with different return types but this works for the st_join function.
# Parallelise any simple features analysis.
st_par <- function(sf_df, sf_func, n_cores, ...) {
# Create a vector to split the data set up by.
split_vector <- rep(1:n_cores, each = nrow(sf_df) / n_cores, length.out = nrow(sf_df))
# Perform GIS analysis
split_results <- split(sf_df, split_vector) %>%
mclapply(function(x) sf_func(x, ...), mc.cores = n_cores)
# Combine results back together. Method of combining probably depends on the
# output from the function. For st_join it is a list of sf objects. This
# satisfies my needs for reverse geocoding
result <- dplyr::bind_rows(split_results)
# Return result
return(result)
}

Calculating percentile across netcdf files fails

I have five netcdf files where each file contains data for a time section. I want to calculate the 98th percentile for the whole timespan for each cell individually.
The accumulated file size for the netcdf files is around 250 MB.
My approach it this:
library(raster)
fileType="\\.nc$"
filenameList <- list.files(path=getwd(), pattern=fileType, full.names=F, recursive=FALSE)
#rasterStack for all layers
rasterStack <- stack()
#stack all data
for(i in 1:length(filenameList)){
filename <- filenameList[i]
stack.temp<-stack(filename)
rasterStack<-stack(rasterStack, stack.temp)
}
#calculate raster containing the 98th percentiles
result <- calc(rasterStack, fun = function(x) {quantile(x,probs = .98,na.rm=TRUE)} )
However, I get this error:
Error in ncdf4::nc_close(x#file#con) :
no slot of name "con" for this object of class ".RasterFile"
The stacking section of my code works, the crash happens during the calc function.
Do you have any idea where this might come from? Is it maybe an issue of where the data is stored (memory/disk)?
Strange, I generated some dummy data and it seems to work just fine, it does not seems to be your method. 250MB is not overly huge. I would clip a small piece of each raster and test if it works.
dat<-matrix(rnorm(16), 4, 4)
r1<-raster(dat)
r2<-r1*2
r3<-r2+1
r4<-r3+4
rStack <- stack(r1,r2,r3,r4)
result <- calc(rStack, fun = function(x) {quantile(x,probs = .98)} )
Perhaps this is related to the odd way you create a RasterStack. You should simply do:
filenames <- list.files(pattern="\\.nc$")
rasterStack <- stack(filenames)

Resources