Rasterize() slow on large SpatialPolygonsDataFrame, alternatives? - raster

I have a large (266,000 elements, 1.7Gb) SpatialPolygonsDataFrame that I am try to convert into 90m resolution RasterLayer (~100,000,000 cells)
The SpatialPolygonsDataFrame has 12 variables of interest to me, thus I intend to make 12 RasterLayers
At the moment, using rasterize(), each conversion takes ~2 days. So nearly a month expected for total processing time.
Can anyone suggest a faster process? I think this would be ~10-40x faster in ArcMap, but I want to do it in R to keep things consistent, and it's a fun challenge!
general code
######################################################
### Make Rasters
######################################################
##Make template
r<-raster(res=90,extent(polys_final))
##set up loop
loop_name <- colnames(as.data.frame(polys_final))
for(i in 1:length(loop_name)){
a <-rasterize(polys_final, r, field=i)
writeRaster(a, filename=paste("/Users/PhD_Soils_raster_90m/",loop_name[i],".tif",sep=""), format="GTiff")
}

I think this is a case for using GDAL, specifically the gdal_rasterize function.
You probably already have GDAL installed on your machine if you are doing a lot of spatial stuff, and you can run GDAL commands from within R using the system() command. I didn't do any tests or anything but this should be be MUCH faster than using the raster package in R.
For example, the code below creates a raster from a shapefile of rivers. This code creates an output file with a 1 value wherever a feature exists, and a 0 where no feature exists.
path_2_gdal_function <- "/Library/Frameworks/GDAL.framework/Programs/gdal_rasterize"
outRaster <- "/Users/me/Desktop/rasterized.tiff"
inVector <- "/Full/Path/To/file.shp"
theCommand <- sprintf("%s -burn 1 -a_nodata 0 -ts 1000 1000 %s %s", path_2_gdal_function, inVector, outRaster)
system(theCommand)
the -ts argument provides the size of the output raster in pixels
the -burn argument specifies what value to put in the output raster where the features exist
-a_nodata indicates which value to put where no features are found
For your case, you will want to add in the -a attribute_name argument, which specifies the name of the attribute in the input vector to be burned into the output raster. Full details on possible arguments here.
Note: that the sprintf() function is just used to format the text string that is passted to the command line using the system() function

Related

R terra function classify create very large files

I have an habitat classification map from Iceland (https://vistgerdakort.ni.is/) with 72 classes in a tif file of 5m*5m pixel size. I want to simplify it, so that there is only 14 classes. I open the files (a tif file and a text file containing the reclassification rules) and use the function classify in the terra package as follow on a subset of the map.
raster <- rast("habitat_subset.tif")
reclass_table<-read.table("reclass_habitat.txt")
habitat_simple<-classify(raster, reclass_table, othersNA=TRUE)
It does exactly what I need it to do and I am able to save the file back to tif using
writeRaster(habitat_simple, "reclass_hab.tif")
The problem is that my initial tif file was 105MB and my new reclassify tif file is 420MB. Since my goal is to reclassify the whole extent of the country, I can't afford to have the file become so big. Any insights on how to make it smaller? I could not find any comments online in relation to this issue.
You can specify the datatype, in your case you should be able to use "INT1U" (i.e., byte values between 0 and 254 --- 255 is used for NA, at least that is the default). That should give a file that is 4 times smaller than when you write it with the default "FLT4S". Based on your question, the original data come with that datatype. In addition you could use compression; I am not sure how well they work with "INT1U". You could have found out about this in the documentation, see ?writeRaster
writeRaster(habitat_simple, "reclass_hab.tif",
wopt=list(datatype="INT1U", gdal="COMPRESS=LZW"))
You could also skip the writeRaster step and do (with terra >= 1.1-4) you can just do
habitat_simple <- classify(raster, reclass_table, othersNA=TRUE,
datatype="INT1U", gdal="COMPRESS=LZW")

How to delete temporary files in parallel task in R

Is it possible to delete temporary files from within a parallelized R task?
I rely on parallelization with doParallel and foreach in R to perform various calculations on small subsets of a huge raster file. This involves cropping a subset of the large raster many times. My basic syntax looks similar to this:
grid <- raster::raster("grid.tif")
data <- raster::raster("data.tif")
cl <- parallel::makeCluster(32)
doParallel::registerDoParallel(cl)
m <- foreach(col=ncol(grid)) %:% foreach(row=nrow(grid)) %dopar% {
# get extent of subset
cell <- raster::cellFromRowCol(grid, row, col)
ext <- raster::extentFromCells(grid, cell)
# crop main raster to subset extent
subset <- raster::crop(data, ext)
# ...
# perform some processing steps on the raster subset
# ...
# save results to a separate file
saveRDS(subset, paste0("output_folder/", row, "_", col)
}
The algorithm works perfectly fine and achieves what I want it to. However, raster::crop(data, ext) creates a small temporary file everytime it is called. This seems to be standard behavior of the raster package, but it becomes a problem, because these temp files are only deleted after the whole code has been executed, and take up way too much disk space in the meantime (hundreds of GB).
In a serial execution of the task I can simply delete the temporary file with file.remove(subset#file#name). However, this does not work anymore when running the task in parallel. Instead, the command is simply ignored and the temp file stays where it is until the whole task is done.
Any ideas as to why this is the case and how I could solve this problem?
There is a function for this removeTmpFiles.
You should be able to use f <- filename(subset), avoid reading from slots (#). I do not see why you would not be able to remove it. But perhaps it needs some fiddling with the path?
temp files are only created when the raster package deems it necessary, based on RAM available and required. See canProcessInMemory( , verbose=TRUE). The default settings are somewhat conservative, and you can change them with rasterOptions() (memfrac and maxmemory)
Another approach is to provide a filename argument to crop. Then you know what the filename is, and you can delete it. Of course you need to take care of not overwriting data from different tasks, so you may need to use some unique id associated with it.
saveRDS( ) won't work if the raster is backed up by a tempfile (as it will disappear).

Read in a list of shapefiles and row bind them in R (preferably using tidy syntax and sf)

I have a directory with a bunch of shapefiles for 50 cities (and will accumulate more). They are divided into three groups: cities' political boundaries (CityA_CD.shp, CityB_CD.shp, etc.), neighborhoods (CityA_Neighborhoods.shp, CityB_Neighborhoods.shp, etc.), and Census blocks (CityA_blocks.shp, CityB_blocks.shp, etc.). They use common file-naming syntaxes, have the same set of attribute variables, and are all in the same CRS. (I transformed all of them as such using QGIS.) I need to write a list of each group of files (political boundaries, neighborhoods, blocks) to read as sf objects and then bind the rows to create one large sf object for each group. However I am running into consistent problems developing this workflow in R.
library(tidyverse)
library(sf)
library(mapedit)
# This first line succeeds in creating a character string of the files that match the regex pattern.
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
# This second line creates a list object from the files.
shapefile_list <- lapply(filenames, st_read)
# This third line (adopted from https://github.com/r-spatial/sf/issues/798) fails as follows.
districts <- mapedit:::combine_list_of_sf(shapefile_list)
Error: Column `District_I` cant be converted from character to numeric
# This fourth line fails in an apparently different way (also adopted from https://github.com/r-spatial/sf/issues/798).
districts <- do.call(what = sf:::rbind.sf, args = shapefile_list)
Error in CPL_get_z_range(obj, 2) : z error - expecting three columns;
The first error appears to be indicating that one of my shapefiles has an incorrect variable class for the common variable District_I but R provides no information to clue me into which file is causing the error.
The second error seems to be looking for a z coordinate but is only finding x and y in the geometry attribute.
I have four questions on this front:
How can I have R identify which list item it is attempting to read and bind is causing an error that halts the process?
How can I force R to ignore the incompatibility issue and coerce the variable class to character so that I can deal with the variable inconsistency (if that's what it is) in R?
How can I drop a variable entirely from the read sf objects that is causing an error (i.e. omit District_I for all read_sf calls in the process)?
More generally, what is going on and how can I solve the second error?
Thanks all as always for your help.
P.S.: I know this post isn't "reproducible" in the desired way, but I'm not sure how to make it so besides copying the contents of all my shapefiles. If I'm mistaken on this point, I'd gladly accept any wisdom on this front.
UPDATE:
I've run
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
shapefile_list <- lapply(filenames, st_read)
districts <- mapedit:::combine_list_of_sf(shapefile_list)
successfully on a subset of three of the shapefiles. So I've confirmed that there is some class conflict between the column District_I in one of the files causing the hold-up when running the code on the full batch. But again, I need the error to identify the file name causing the issue so I can fix it in the file OR need the code to coerce District_I to character in all files (which is the class I want that variable to be in anyway).
A note, particularly regarding Pablo's recommendation:
districts <- do.call(what = dplyr::rbind_all, shapefile_list)
results in an error
Error in (function (x, id = NULL) : unused argument
followed by a long string of digits and coordinates. So,
mapedit:::combine_list_of_sf(shapefile_list)
is definitely the mechanism to read from the list and merge the files, but I still need a way to diagnose the source of the column incompatibility error across shapefiles.
So after much fretting and some great guidance from Pablo (and his link to https://community.rstudio.com/t/simplest-way-to-modify-the-same-column-in-multiple-dataframes-in-a-list/13076), the following works:
library(tidyverse)
library(sf)
# Reads in all shapefiles from Directory that include the string "_CDs".
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
# Applies the function st_read from the sf package to each file saved as a character string to transform the file list to a list object.
shapefile_list <- lapply(filenames, st_read)
# Creates a function that transforms a problem variable to class character for all shapefile reads.
my_func <- function(data, my_col){
my_col <- enexpr(my_col)
output <- data %>%
mutate(!!my_col := as.character(!!my_col))
}
# Applies the new function to our list of shapefiles and specifies "District_I" as our problem variable.
districts <- map_dfr(shapefile_list, ~my_func(.x, District_I))

Select columns when reading in files with st_read

I am trying to read 39 json files into a common sf dataset in R.
Here is the method I've been trying:
path <- "~/directory"
file.names <- as.list(dir(path, pattern='.json', full.names=T))
geodata <- do.call(rbind, lapply(file.names, st_read))
The problem is in the last line: rbind cannot work because the files have different numbers of columns. However, they all have three columns in common, and which I care about: MOVEMENT_ID, DISPLAY_NAME and geometry. How could I select only these three columns when running st_read?
I've tried running geodata<-do.call(rbind, lapply(file.names, st_read,select=c('MOVEMENT_ID', 'DISPLAY_NAME', 'geometry'))) but, in this case, st_read does not seem to recognise the geometry column (error: 'no simple features geometry column pressent').
I've also tried to use fread in place of st_read but this doesn't work as fread is not adapted to spatial data.
Run lapply over a function that calls st_read and then does what you need to it, something like:
read_my_json = function(f){
s = st_read(f)
return(s[,c("MOVEMENT_ID","DISPLAY_NAME")]
}
(I'm pretty sure you don't have to select the geometry as well, you get that for free when selecting columns of an sf spatial object)
then do.call(rbind, lapply(file.names, read_my_json)) should work.
no extra packages need to be included and it has the big advantage in that you can test this function to see how it works on a single item before throwing a thousand at it.

raster images stacked recursively

I have the following problem, please.
I need to read recursively raster images, stack and store them in a file with different names (e.g. name1.tiff, name2.tiff, ...)
I tried the following:
for (i in 10) {
fn <- system.file ("external / test.grd", package = "raster")
fn <-stack (fn) # not sure if this idea can work.
fnSTACK[,, i] <-fn
}
here expect a result of the form:
dim (fnSTACK)
[1] 115 80 10
or something like that
but it didn't work.
Actually, I have around 300 images that I have to be store with different names.
The purpose is to extract time series information (if you know another method or suggestions I would appreciate it)
Any suggestions are welcomed. Thank you in advance for your time.
What I would first do is put all your *.tiff in a single folder. Then read all their names into a list. Stack them and then write a multi-layered raster. I'm assuming all the images have the same extent and projection.
### Load necessary packages
library(tiff)
library(raster)
library(sp)
library(rgdal) #I cant recall what packages you might need so this is probably
library(grid) # overkill
library(car)
############ function extracts the last n characters from a string
############ without counting the last m
subs <- function(x, n=1,m=0){
substr(x, nchar(x)-n-m+1, nchar(x)-m)
}
setwd("your working directory path") # you set your wd to were all your images are
filez <- list.files() # creates a list with all the files in the wd
no <- length(filez) # amount of files found
imagestack <- stack() # you initialize your raster stack
for (i in 1:no){
if (subs(filez[i],4)=="tiff"){
image <- raster(filez[i]) # fill up raster stack with only the tiffs
imagestack <- addLayer(imagestack,image)
}
}
writeRaster(imagestack,filename="output path",options="INTERLEAVE=BAND",overwrite=TRUE)
# write stack
I did not try this, but it should work.
Your question is rather vague and it would have helped if you had provided a full example script such that it could be more easily understood. You say you need to read several (probably not recursively?) raster images (files, presumably) and create a stack. Then you need to store them in files with different names. That sounds like copying the files to new files with a different names, and there are R functions for that, but that is probably not what you intended to ask.
if you have a bunch of files (with full path names or in the working directory), e.g. from list.files()
f <- system.file ("external/test.grd", package = "raster")
ff <- rep(f, 10)
you can do
library(raster)
s <- stack(ff)
I am assuming that you simply need this stack for operations in R (it is an object, but not a file). You can extract the values in many ways (see the help files and vignette of the raster package). If you want a three dimensional array, you can do
a <- as.array(s)
dim(a)
[1] 115 80 10
thanks "JEquihua" for your suggestion, just need to add the initial variable before addLayer ie:
for (i in 1:no){
if (subs(filez[i],4)=="tiff"){
image <- raster(filez[i]) # fill up raster stack with only the tiffs
imagestack <- addLayer(imagestack,image)
}
}
And sorry "RobertH", I'm newbie about R. I will be ask, more sure or exact by next time.
Also, any suggestions for extracting data from time series of MODIS images stacked. Or examples of libraries: "rts ()", "ndvits ()" or "bfast ()"
Greetings to the entire community.
Another method for stacking
library(raster)
list<-list.files("/PATH/of/DATA/",pattern="NDVI",
recursive=T,full.names=T)
data_stack<-stack(list)

Resources