Custom spatial processing function consumes a lot of memory as code runs in R - r

I have several rasters, 343 to be more exact, from Cropscape. I need to get the locations (centroids) and area measurements of pixels that represent potatoes and tomatoes based on the associated values in the rasters. The pixel values are 43 and 54, respectively. Cropscape provides rasters separated by year and state, except for 2016, which has the lower 48 states combined. The rasters are saved as GeoTiffs on a Google Drive and I am using Google File Stream to connect to the rasters locally.
I want to create a SpatialPointsDataFrame from the centroids of each pixel or group of adjacent pixels for tomatoes and potatoes in all the rasters. Right now, my code will
Subset the rasters to potatoes and tomatoes
Change the raster subsets to polygons, one for potatoes and one for tomatoes
Create centroids from each polygon
Create a SpatialPointsDataFrame based on the centroids
Extract the area measurement for each area of interest with SpatialPointsDataFrame
Write the raster subsets and each polygon to a file.
Code:
library(raster)
library(rgdal)
library(rgeos)
dat_dir2 = getwd()
mepg <- make_EPSG()
ae_pr <- mepg[mepg$code == "5070", "prj4"]
# Toy raster list for use with code
# I use `list.files()` with the directories that hold
# the rasters and use list that is generated from
# that to read in the files to raster. My list is called
# "tiflist". Not used in the code, but mentioned later.
rmk1 <- function(x, ...) {
r1 = raster(ncol = 1000, nrow = 1000)
r1[] = sample(1:60, 1000000, replace = T)
proj4string(r1) = CRS(ae_pr)
return(r1)
}
rlis <- lapply(1:5, rmk1)
#Pixel values needed
ptto <- c(43, 54)
# My function to go through rasters for locations and area measurements.
# This code is somewhat edited to work with the demo raster list.
# It produces the same output as what I wanted, but with the demo list.
pottom <- function(x, ...) {
# Next line is not necessary with the raster list created above.
# temras = raster(x)
now = format(Sys.time(), "%b%d%H%M%S")
nwnm = paste0(names(x), now)
rasmatx = match(x = x, table = ptto)
writeRaster(rasmatx, file.path( dat_dir2, paste0(nwnm,"ras")), format = "GTiff")
tempol = rasterToPolygons(rasmatx, fun = function(x) { x > 0 & x < 4}, dissolve = T)
tempol2 = disaggregate(tempol)
# for potatoes
tempol2p = tempol2[tempol2$layer == '1',]
if (nrow(tempol2p) > 0) {
temcenp = gCentroid(tempol2p, byid = T)
temcenpdf = SpatialPointsDataFrame(temcenp, data.frame(ID = 1:length(temcenp) , temcenp))
temcenpdf$pot_p = extract(rasmatx, temcenpdf)
temcenpdf$areap_m = gArea(tempol2p, byid = T)
# writeOGR(temcenpdf, dsn=file.path(dat_dir2), paste0(nwnm, "p"), driver = "ESRI Shapefile")
}
# for tomatoes
tempol2t = tempol2[tempol2$layer == '2',]
if (nrow(tempol2t) > 0) {
temcent = gCentroid(tempol2t, byid = T)
temcentdf = SpatialPointsDataFrame(temcent, data.frame(ID = 1:length(temcent) , temcent))
temcentdf$tom_t = extract(rasmatx, temcentdf)
temcentdf$areat_m = gArea(tempol2t, byid = T)
writeOGR(temcentdf, dsn=file.path(dat_dir2), paste0(nwnm,"t"), driver = "ESRI Shapefile")
}
}
lapply(rlis, pottom)
I know I should provide some toy data and I created some, but I don't know if they exactly recreate my problem, which follows.
Besides my wonky code, which seems to work, I have a bigger problem. A lot of memory is used when this code runs. The tiflist can only get through the first 4 files of the list and by then RAM, which is 16 GB on my laptop, is completely consumed. I'm pretty sure it's the connections to the Google Drive, since the cache for the drive stream is at least 8 GB. I guess each raster is staying open after being connected to in the Google Drive? I don't know how to confirm that.
I think I need to get the function to clear out all of the objects that are created, e.g. temras, rasmatx, tempol, etc., after processing each raster, but I'm not sure how to do that. I did try adding rm(temras ...) to the end of the function, but when I did that, there was no output at all from the function after 10 minutes and by then, I've usually got the first 3 rasters processed.
27/Oct EDIT after comments from RobertHijmans. It seems that the states with large geographic extents are causing problems with rasterToPolygons(). I edited the code from the way it works for me locally to work with the demo data I included, since RobertHijmans pointed out it wasn't functional. So I hope this is now reproducible.

I feel silly answering my own question, but here it is: the rasterToPolygons function is notoriously slow. I was unaware of this issue. I waited 30 minutes before killing the process with no result in one of my attempts. It works on the conditions I require for rasters for Alabama and Arkansas for example, but not California.
A submitted solution, which I am in the process of testing, comes from this GitHub repo. The test is ongoing at 12 minutes, so I don't know if it works for an object as large as California. I don't want to copy and paste someone else's code in an answer to my own question.
One of the comments suggested using profvis, but I couldn't really figure out the output. And it hung with the process too.

Related

How to write a loop for creating cropped raster for every id of a shapefile with a raster base?

I'm still new to R and don't know how to create a loop for my workprocess to make it more efficient.
I have a Digital Elevation Model (raster Barrow_5m.tif), a shapefile for lakes and buffer with 10 iDs in a row of the table each.
In the script below I created a new raster file for all values of the lake and the buffer shape file with the data from the DEM raster. This works fine.
setwd("...")
Barrow_5m <- raster("Barrow_5m.tif")
Barrow_DTLB <- st_read("Barrow_DTLB.shp")
Barrow_DTLB_Buffer <- st_read("Barrow_DTLB_BufferOUT.shp")
Barrow_lake <- crop(Barrow_5m, extent(Barrow_DTLB))
raster_lake <- rasterize(Barrow_DTLB, Barrow_lake, mask = TRUE)
Barrow_buffer <- crop(Barrow_2m, extent(Barrow_DTLB_Buffer))
raster_buffer <- rasterize(Barrow_DTLB_Buffer, Barrow_buffer, mask = TRUE)
writeRaster(raster_lake, "raster_lake.tif")
writeRaster(raster_buffer, "raster_buffer.tif")
But now I want to have a raster file for every id of the lake and the buffer shapefile seperately, so 2x10 files.
I thought it's best to write a loop for this, but my skills are not enough so far to do this.
Also other questions didn't bring the solution so far. I tried to help me with this.
Alternatively I could use my end product tif from the script above and undo this in files for every ID.
I want to write the loop and not do it by hand for all the IDs of the shapefiles, because afterwards I am going to do the same with an even bigger shapefile of more values.
I found a solution now, by extracting data by the ID.
It creates a largelist with 11 elements and all values of each id, which is sufficient for my further work. You can also directly creat the mean, max, min, etc values of each element (so each ID).
k <- Barrow_DTLB$ID #k= number of rows
LakesA <- extract(raster_lakeA, Barrow_DTLB[k, ])
LakesA_mean <- extract(raster_lakeA, Barrow_DTLB[k, ], fun=mean)
Maybe this solution is also helpful for a few, who already viewed the question.
I think this should work:
for (i in unique(raster_lake)){
r <- raster_lake
r[!(values(r) == i)] <- NA
r <- trim(r)
writeRaster(r, paste0("raster_lake_", i, ".tif"))
}

efficient use of raster functions in r

I have 500+ points in a SpatialPointsDataFrame object; I have a 1.7GB (200,000 rows x 200,000 cols) raster object. I want to have a tabulation of the values of the raster cells within a buffer around each of the 500+ points.
I have managed to achieve that with the code below (I got a lot of inspiration from here.). However, it is slow to run and I would like to make it run faster. It actually runs OK for buffers with "small" widths, say 5km ro even 15km (~1 million cells), but it becomes super slow when buffer increases to say 100km (~42 million cells).
I could easily improve on the loop below by using something from the apply family and/or a parallel loop. But my suspicion is that it is slow because the raster package writes 400Mb+ temporary files for each interaction of the loop.
# packages
library(rgeos)
library(raster)
library(rgdal)
myPoints = readOGR(points_path, 'myLayer')
myRaster = raster(raster_path)
myFunction = function(polygon_obj, raster_obj) {
# this function return a tabulation of the values of raster cells
# inside a polygon (buffer)
# crop to extent of polygon
clip1 = crop(raster_obj, extent(polygon_obj))
# crops to polygon edge & converts to raster
clip2 = rasterize(polygon_obj, clip1, mask = TRUE)
# much faster than extract
ext = getValues(clip2)
# tabulates the values of the raster in the polygon
tab = table(ext)
return(tab)
}
# loop over the points
ids = unique(myPoints$ID)
for (id in ids) {
# select point
myPoint = myPoints[myPoints$ID == id, ]
# create buffer
myPolygon = gBuffer(spgeom = myPoint, byid = FALSE, width = myWidth)
# extract the data I want (projections, etc are fine)
tab = myFunction(myPolygon, myRaster)
# do stuff with tab ...
}
My questions:
Am I right to partially blame the writing operations? If I managed to avoid all those writing operations, would this code run faster? I have access to a machine with 32GB of RAM -- so I guess it is safe to assume I could load the raster to the memory and need not to write temporary files?
What else could I do to improve efficiency in this code?
I think you should approach it like this
library(raster)
library(rgdal)
myPoints <- readOGR(points_path, 'myLayer')
myRaster <- raster(raster_path)
e <- extract(myRaster, myPoints, buffer=myWidth)
And then something like
etab <- sapply(e, table)
It is hard to answer your question #1 as we do not know enough about your data (we do not know how many cells are covered by a "100 km" buffer). But you can set options about when to write to file with the rasterOptions function. You notice that getValues is faster than extract, based on the post you link to, but I think that is wrong, or at least not very important. The combination of crop, rasterize and getValues should have a similar performance as extract (which does almost exactly that under the hood). If you go this route anyway, you should pass an empty RasterLayer, created by raster(myRaster) for faster cropping.

Loop polygon clipping with gIntersection

Question
I am trying to perform a number of polygon clips using the gIntersection function with R in a loop. I can obtain the correct clips and re-enter data manually (so I can turn the resulting SpatialPolygons object back into a SpatialPolygonsDataFrame object). What I can't do is get this working in a loop with for() or apply().
At the moment this isn't a problem. I have 9 English regions (with London), so it's not a huge challenge to set each clip up manually. But, I want to eventually clip LSOAs in LADs, which essentially means setting up >400 clips.
So, my question is, how do I turn my manual clips into a working loop?
Minimal Reproducible Example
To keep things simple, let's use the English regions (n = 9). For each of the 9 regions, I'm going to clip the counties. The following code loads the appropriate shapefiles and reprojects them as British National Grid:
require(rgdal)
require(rgeos)
# English counties shapefile (~ 10MB zipped)
download.file(
"https://dl.dropboxusercontent.com/s/6o0mi28pjo1kh9k/england-counties.zip",
"ec", method = "wget")
unzip("ec")
ec <- readOGR("england-counties", "england_ct_2011")
proj4string(ec) <- CRS("+init=epsg:27700")
# English regions (~6MB zipped)
download.file(
"https://dl.dropboxusercontent.com/s/p69m0vk2fh4xe3o/england-regions-2011.zip",
"er", method = "wget")
unzip("er")
er <- readOGR("england-regions-2011", "England_gor_2011")
proj4string(er) <- CRS("+init=epsg:27700")
You should be left with two objects, er (English regions) and ec (English counties). Both are SpatialPolygonsDataFrame objects.
Taking the first region - East of England E12000006 - let's clip the counties and turn the result back in to a SpatialPolygonsDataFrame object:
ee <- gIntersection(ec, er[er$CODE == "E12000006", ],
byid = T, drop_not_poly = T)
row.names(ee) <- as.character(gsub(" 0", "", row.names(ee)))
# gIntersection adds ' 0' to each row.name?
ee <- SpatialPolygonsDataFrame(ee, ec#data[row.names(ee), ])
A plot of ee confirms this worked:
As you can see, this is a nice workflow for just a few shapes, but I really want to loop through all regions and, ultimately, many more polygons.
What I've Tried
I'm not very good with apply() loops, so what I've tried so far is a for() loop (which I know is relatively slow, but still quicker than typing everything out!):
regions <- as.character(er$CODE) # length = 9 as expected
for(i in 1:length(regions)){
as.name(paste0(regions[i], "c")) <-
gIntersection(ec, er[er$CODE == regions[1], ], byid = T, drop_not_poly = T)
}
Rather than the expected behaviour I get the following error:
Error in as.name(paste0(regions[1], "c")) <- gIntersection(ec, er[er$CODE == :
could not find function "as.name<-"
I also tried wrapping the object name in an eval() but get the following error:
Error in eval(as.name(paste0(regions[1], "c"))) <- gIntersection(ec, er[er$CODE == :
could not find function "eval<-"
What am I missing?
In addition to the gIntersection, I would like to re-create a SpatialPolygonsDataFrame object if possible. I've tried the following code, having done one gIntersection manually, but again it doesn't work:
for(i in 1:length(regions)){
row.names(as.name(paste0(regions[i], "c"))) <- as.character(gsub(" 0", "",
row.names(as.name(paste0(regions[i], "c")))))
}
I get the following error:
Error in `rownames<-`(x, value) :
attempt to set 'rownames' on an object with no dimensions
I'm also not sure how to increment the " 0", as this increases by one for each new region (" 1", " 2", etc.)
Again, setting the first example up manually I also can't perform the final SpatialPolygonsDataFrame step:
for(i in 1:length(regions)){
as.name(regions[i]) <- SpatialPolygonsDataFrame(regions[i],
ec#data[row.names(regions[i], )])
}
For this I get the following error:
Error in stopifnot(length(Sr#polygons) == nrow(data)) :
trying to get slot "polygons" from an object of a basic class ("character") with no
slots
Where I've looked
The following SO examples are related by do not seem to help, or at least I can't see how I would make them apply to this example:
rgeos gIntersection in loop takes too long to clip path network
How to clip WorldMap with polygon in R?
https://gis.stackexchange.com/questions/33278/no-intersection-found-between-polygons-i-know-intersect
Thanks for taking the time to read this.
Does this help?
ee <- lapply(regions, function(x)
gIntersection(ec, er[er$CODE == x, ], byid = TRUE, drop_not_poly = TRUE))
This gives you a list of SpatialPolygonsDataFrames, one for each region. Which you can access in the usual way, e.g.
ee[[1]]
plot(ee[[1]]) # to plot the first region with counties
Edit
Your orignial code should work with a sligh modification (see blow).
res <- list()
for (i in 1:length(regions)) {
ee <- gIntersection(ec, er[er$CODE == regions[i], ],
byid = TRUE, drop_not_poly = TRUE)
row.names(ee) <- as.character(gsub(paste0(" ", i-1), "", row.names(ee)))
ee <- SpatialPolygonsDataFrame(ee, ec#data[row.names(ee), ])
res[[i]] <- ee
}
If that solves the problem, then the problem was, that row names of ee always incremented by one and you did not account for this.

Merging multiple rasters in R

I've been trying to find a time-efficient way to merge multiple raster images in R. These are adjacent ASTER scenes from the southern Kilimanjaro region, and my target is to put them together to obtain one large image.
This is what I got so far (object 'ast14dmo' representing a list of RasterLayer objects):
# Loop through single ASTER scenes
for (i in seq(ast14dmo.sd)) {
if (i == 1) {
# Merge current with subsequent scene
ast14dmo.sd.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
} else if (i > 1 && i < length(ast14dmo.sd)) {
tmp.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
ast14dmo.sd.mrg <- merge(ast14dmo.sd.mrg, tmp.mrg, tolerance = 1)
} else {
# Save merged image
writeRaster(ast14dmo.sd.mrg, paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg", sep = ""), format = "GTiff", overwrite = TRUE)
}
}
As you surely guess, the code works. However, merging takes quite long considering that each single raster object is some 70 mb large. I also tried Reduce and do.call, but that failed since I couldn't pass the argument 'tolerance' which circumvents the different origins of the raster files.
Anybody got an idea of how to speed things up?
You can use do.call
ast14dmo.sd$tolerance <- 1
ast14dmo.sd$filename <- paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg.tif", sep = "")
ast14dmo.sd$overwrite <- TRUE
mm <- do.call(merge, ast14dmo.sd)
Here with some data, from the example in raster::merge
r1 <- raster(xmx=-150, ymn=60, ncols=30, nrows=30)
r1[] <- 1:ncell(r1)
r2 <- raster(xmn=-100, xmx=-50, ymx=50, ymn=30)
res(r2) <- c(xres(r1), yres(r1))
r2[] <- 1:ncell(r2)
x <- list(r1, r2)
names(x) <- c("x", "y")
x$filename <- 'test.tif'
x$overwrite <- TRUE
m <- do.call(merge, x)
The 'merge' function from the Raster package is a little slow. For large projects a faster option is to work with gdal commands in R.
library(gdalUtils)
library(rgdal)
Build list of all raster files you want to join (in your current working directory).
all_my_rasts <- c('r1.tif', 'r2.tif', 'r3.tif')
Make a template raster file to build onto. Think of this a big blank canvas to add tiles to.
e <- extent(-131, -124, 49, 53)
template <- raster(e)
projection(template) <- '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs'
writeRaster(template, file="MyBigNastyRasty.tif", format="GTiff")
Merge all raster tiles into one big raster.
mosaic_rasters(gdalfile=all_my_rasts,dst_dataset="MyBigNastyRasty.tif",of="GTiff")
gdalinfo("MyBigNastyRasty.tif")
This should work pretty well for speed (faster than merge in the raster package), but if you have thousands of tiles you might even want to look into building a vrt first.
You can use Reduce like this for example :
Reduce(function(...)merge(...,tolerance=1),ast14dmo.sd)
SAGA GIS mosaicking tool (http://www.saga-gis.org/saga_tool_doc/7.3.0/grid_tools_3.html) gives you maximum flexibility for merging numeric layers, and it runs in parallel by default! You only have to translate all rasters/images to SAGA .sgrd format first, then run the command line saga_cmd.
I have tested the solution using gdalUtils as proposed by Matthew Bayly. It works quite well and fast (I have about 1000 images to merge). However, after checking with document of mosaic_raster function here, I found that it works without making a template raster before mosaic the images. I pasted the example codes from the document below:
outdir <- tempdir()
gdal_setInstallation()
valid_install <- !is.null(getOption("gdalUtils_gdalPath"))
if(require(raster) && require(rgdal) && valid_install)
{
layer1 <- system.file("external/tahoe_lidar_bareearth.tif", package="gdalUtils")
layer2 <- system.file("external/tahoe_lidar_highesthit.tif", package="gdalUtils")
mosaic_rasters(gdalfile=c(layer1,layer2),dst_dataset=file.path(outdir,"test_mosaic.envi"),
separate=TRUE,of="ENVI",verbose=TRUE)
gdalinfo("test_mosaic.envi")
}
I was faced with this same problem and I used
#Read desired files into R
data_name1<-'file_name1.tif'
r1=raster(data_name1)
data_name2<-'file_name2.tif'
r2=raster(data_name2)
#Merge files
new_data <- raster::merge(r1, r2)
Although it did not produce a new merged raster file, it stored in the data environment and produced a merged map when plotted.
I ran into the following problem when trying to mosaic several rasters on top of each other
In vv[is.na(vv)] <- getValues(x[[i]])[is.na(vv)] :
number of items to replace is not a multiple of replacement length
As #Robert Hijmans pointed out, it was likely because of misaligned rasters. To work around this, I had to resample the rasters first
library(raster)
x <- raster("Base_raster.tif")
r1 <- raster("Top1_raster.tif")
r2 <- raster("Top2_raster.tif")
# Resample
x1 <- resample(r1, crop(x, r1))
x2 <- resample(r2, crop(x, r2))
# Merge rasters. Make sure to use the right order
m <- merge(merge(x1, x2), x)
# Write output
writeRaster(m,
filename = file.path("Mosaic_raster.tif"),
format = "GTiff",
overwrite = TRUE)

How would you speed up this code? Reducing the resolution of data from a netcdf and then converting it to xyz format for stats

I'm taking a netcdf of maize yields and area harvested, shrinking the resolution from 2.5 arc-minutes to .5degrees, and then converting the whole thing to XYZ format so that I can make it "talk" more easily to data that I've got in that format. (I suppose that I could turn my other data into matrix form, but I like xyz.)
The data is here.
The code below defines a few functions to calculate total production from area harvested and average yields, and then it makes some "feeder" data t use when querying the netcdf's, then it uses plyr to loop through the feeder, extract the data, apply the functions, and then save in xyz. It works, but it takes about 30 minutes to run only one of these files, and I've got more than 100. Any suggestions for ways to optimize this code would be great. Would it be faster to extract bigger chunks of data and apply functions to them? Like, maybe a whole stripe of the earth? How would I know a priori whether this would be faster or not?
rm(list=ls())
library(ncdf)
library(reshape)
library(plyr)
library(sp)
library(maps)
library(rgeos)
library(maptools)
library(rworldmap)
getha = function(lat,size=lat[1]-lat[2]){
lat1 = (lat-size/2)*pi/180
lat2 = (lat+size/2)*pi/180
lon1 = (0-size/2)*pi/180 #lon doesn't come in because all longitudes are great circles
lon2 = (0+size/2)*pi/180
6371^2 * abs(sin(lat1)-sin(lat2))*abs(lon1-lon2)*100 #6371 is the radius of the earth and 100 is the number of ha in a km^2
}
gethamat = function(mat,latvec,blocksize=6){
a = getha(latvec)
areamat = matrix(rep(a,blocksize),blocksize)
area = t(mat)*areamat #The matrix is transposed because the dimensions of the Ramankutty's netcdf's are switched
area
}
getprod = function(yieldblock,areablock,latvec){
b = gethamat(areablock,latvec)
sum(t(yieldblock)*b,na.rm=TRUE)
}
lat = as.matrix(seq(from=89.75,to=-89.75,by=-.5))
lon = as.matrix(seq(from=-179.75,to=179.75,by=.5))
lon = seq.int(from=1,to=4320,by=6)
lat = seq.int(from=1,to=2160,by=6)
lat = rep(lat,720)
lon = t(matrix(lon,720,360))
lon = as.data.frame(lon)
l = reshape(lon,direction="long",varying=list(colnames(lon)),v.names = "V")
coords = data.frame(cbind(l[,2],lat))
colnames(coords) = c("lng","lat")
feeder = coords
head(feeder)
maize.nc = open.ncdf('maize_5min.nc')
getcrops = function(feed,netcdf,var="cropdata"){
yieldblock = get.var.ncdf(netcdf,varid=var,start = c(as.numeric(feed[1]),as.numeric(feed[2]),2,1),count = c(6,6,1,1))
areablock = get.var.ncdf(netcdf,varid=var,start = c(as.numeric(feed[1]),as.numeric(feed[2]),1,1),count = c(6,6,1,1))
lat = get.var.ncdf(netcdf,varid="latitude",start = feed[2],count = 6)
prod = getprod(yieldblock,areablock,lat)
lon = get.var.ncdf(netcdf,varid="longitude",start = feed[1],count = 6)
#print(c(mean(lat),mean(lon)))
data.frame(lat=mean(lat),lon = mean(lon),prod=prod)
}
out = adply(as.matrix(feeder),1,getcrops,netcdf=maize.nc,.parallel=FALSE)
Thanks in advance.
plyr functions are notoriously slow when the number of chunks grows larger. I would really recommend keeping the data in a multi-dimensional array. This allows you to use apply to e.g. get mean for all lat-lon combinations etc. The multi-dimensional array takes less RAM storage, as the metadata is not in stored directly as columns, but implicitly within the dimensions of the array. In addition, apply is often much much faster than using plyr. The ncdf package natively reads the data into multi-dimensional arrays, so this also saves you a processing step (e.g. using melt).
After reducing the dataset, I would often use melt to go to what you call XYZ format for plotting. But by then the dataset is so small that this does not really mattern.

Resources