random sampling from a large raster using clusterR - r

I need to take a 5% random sample from a very large raster and return a new raster. I am trying to use sampleRandom from the raster package, but the process is very slow (I only have 8GB RAM on my machine, running 64-bit R). The raster has been cropped/masked to match an irregularly shaped study area boundary, as well - so has NA values in the rectangular extent around the polygon boundary and some internal NA values - I'm trying to sample only from the non-NA values. I've tried both sampling 5% and reversing that to sampling 95% - both ran for >2 hours without producing a result, at which point I terminated the process.
I am trying to speed it up by running it in parallel using the clusterR command, but I'm new to both the sampleRandom command and to using clusterR. My code runs, but I get all of the non-NA pixels returned, so the sample doesn't seem to working. Is this a problem with my code or is it that sampleRandom can't run with clusterR?
Here is a description of my raster layer:
class : RasterLayer
dimensions : 23828, 19095, 454995660 (nrow, ncol, ncell)
resolution : 56, 56 (x, y)
extent : -1220192, -150872, 87580, 1421948 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=aea +lat_1=44.75 +lat_2=55.75 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
data source : C:\GIS\carbon_cows\Intact\conv_mod.tif
names : conv_mod
values : 1, 1 (min, max)
And here is the code I have tried:
tot<-cellStats(conv.mod,'sum', na.rm=TRUE) #get the total pixels in conv.mod
sampsize<-tot * 0.05 #calculate how many pixels would represent 5%
removeTmpFiles() #clear some memory
numcores<-detectCores() -1
clusterExport(cl,"sampsize", envir = .GlobalEnv)
conv.perc <- clusterR(conv.mod,sampleRandom,args=list(size=sampsize,na.rm=TRUE,asRaster=TRUE))
Here are the total non-NA cells in the original raster layer:
tot<-cellStats(conv.mod,'sum', na.rm=TRUE)
and the number that should be a 5% sample:
sampsize<-tot * 0.05
But, the resulting raster has the same number of non-NA pixels as the original raster:
I've also tried reversing the sample size calculation and running sampleRandom, so that I'm requesting a 95% sample. But, I get the same result.
I'd appreciate any help in understanding why this code is not running as expected. Thanks!

Never mind. I was able to take advantage of this post: https://gis.stackexchange.com/questions/17255/random-sampling-of-raster-using-r and the reply by whuber.
The following code solved my problem, without the use of a cluster:
col.conv <- ncol(conv.mod)
r[runif(col.conv*row.conv) >= 0.95] <- NA # Randomly *unselect* 5% of the data
That code ran in ~3 minutes, as opposed to over an hour for putting the simpleRandom code in the clusterR command. I still wonder why simpleRandom was not actually taking a sample and also why this new code is so much more efficient, but happy to have the problem solved.


Using R to obtain slope raster from DEM GRID raster

After some extensive googling, I wasn't able to find my answer (first time I couldn't surmount the issue by looking at others questions/answers). I am new to asking questions, so forgive any missteps.
I am attempting to perform what ArcGIS or QGIS performs with the slope tool, just within R. To do so I have been importing a raster that I exported from ArcGIS in GRID format with the following characteristics:
class : RasterLayer
dimensions : 821, 581, 477001 (nrow, ncol, ncell)
resolution : 4.996121, 4.996121 (x, y)
extent : 2832147, 2835049, 14234048, 14238150 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=34.75 +lon_0=-118.583333333333 +k=0.9999 +x_0=800000.000000001 +y_0=3999999.99999999 +datum=NAD83 +units=us-ft +no_defs
source : rr_2020_shell
names : rr_2020_shell
values : 5623.253, 6401.356 (min, max)
It is already projected in the correct coordinate system (EPSG: 3423) but when I go to find the slope using the following code:
RR_2020_Slope = terrain(RR_2020_St1_Raster,'slope', units = 'degrees', neighbors = 8, filename = 'RR_2020_Slope.grd', overwrite = T)
The result is a slope raster that ranges from 0 to 1.28°, which is very different from what I have calculated in ArcGIS using the slope tool. Using the same DEM raster in the same projection in ArcGIS I used the slope tool with an input of 'Degree' for the output measurement, 'Planar' for the method, and 1 for Z factor and my resulting slope raster ranges from 0.001 to 73.396°.
Overall I am wondering where my mistake in R originates from, is it an elevation resolution problem? Are there issues with my projection? Forgive me, I can't necessarily include the data as they are sensitive materials but perhaps there is a clear and obvious mistake in my approach or assumptions about the functions I have used?
The only red flag I see is that you say "it is already projected in the correct coordinate system". Projecting raster data degrades the quality. As cell values get smoothed, the slopes will get smaller. This may be particularly pronounced if the relief is at the scale of the cell size (e.g. sand dunes vs mountain chains). Have you compared with what you get with the original data?
Another source of error could be that the units of the values are different from the units of the coordinate reference system. But it would appear that in your case both are in feet.
Can you also try this with terra::terrain()?

Improving computational speed of Zonal Statistics on 150gb+ of raster tiles in R (terra package and vrt)

I'm working in R - I have a directory of raster (.tif) tiles at 60cm resolution, downloaded from Google Earth Engine (NAIP 2018 NDVI). (I am running my analysis on pc rather than in Google Earth Engine due to human subjects requirements of my polygon data.) The 52 tiles are each 1.2-3.8GB in size. I also have 982 polygons, for which I'm trying to calculate the zonal means from these rasters. My code (below) uses the terra package, and instead of mosaic-ing the tiles into a very large singular raster, I've chosen to create a VRT (virtual raster) file.
I am running this code on a Xeon Gold 6134 # 3.2GHZ and have 128GB of ram. No matter what I set my terraOptions() to, R doesn't even come close to using a significant proportion of my processor or ram potential.
With this code, all 982 polygons will take 11.8 days to run. I would GREATLY appreciate if anyone could point me at specific tricks/tools that I may not have already tried (e.g., I've tried working with all the terraOptions, I've tried the raster package and the exact_extract package. The exact_extract() function won't work for me as I am using a SpatRaster/VRT and a sf polygon object as inputs - again to avoid mosaic-ing a very large singular raster.)
Thank you. (I apologize that I cannot share data, as its either too large or human subjects related...) Here is the un-looped code:
Edit: 52 tiles of 1.2-3.8GB EACH. My original quote of 150GB total directory size was incorrect as this was the compressed size in ArcGIS.
c <- "path/to directory of raster tiles"
v <- "path/new.vrt" # name of virtual raster
ras_lst <- list.files(c, full.names=T, pattern=".tif$")
terra::vrt(ras_lst, v, overwrite = T)
ras <- rast(v)
w <- vect("path to polygon shapefile")
w2 <- terra::project(w, terra::crs(ras)) # transform proj to same as raster tiles
e2 <- terra::extract(ras, w2, fun="mean")
e2 # zonal mean value for 1 polygon (of 982)
show(ras) produces:
class : SpatRaster
dimensions : 212556, 247793, 1 (nrow, ncol, nlyr)
resolution : 5.389892e-06, 5.389892e-06 (x, y)
extent : -118.964, -117.6284, 33.68984, 34.8355 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +datum=WGS84 +no_defs
source : naip2018mos.vrt
name : naip2018mos

R- Raster math while preserving integer format

I have a some large rasters (~110 MB each) I want to perform some raster calculations on. For the purposes of this example, I want to average the files SNDPPT_M_sl1_1km_ll.tif and SNDPPT_M_sl2_1km_ll.tif, available at this website. In reality, the math is a bit more complex (some multiplication and division of several rasters).
Both input rasters are integer (INT1U) data, and I would like the output to also be INT1U. However, whenever I try to do a raster calculation, it creates intermediate temporary files in floating point format which are very large in size. I am working on a laptop with about 7 GB of free hard drive space, which gets filled before the calculation is complete.
# load packages
## script control
# which property?
prop <- "SNDPPT"
# load layers
r.1 <- raster(paste0("1raw/", prop, "_M_sl1_1km_ll.tif"))
r.2 <- raster(paste0("1raw/", prop, "_M_sl2_1km_ll.tif"))
# allocate space for output raster - this is about 100 MB (same size as input files)
r.out <- writeRaster(r.1,
filename=paste0("2derived/", prop, "_M_meanTop200cm_1km_ll.tif"),
# perform raster math calculation
r.out <- integer(round((r.out+r.2)/2))
# at this point, my hard drive fills due to temporary files > 7 GB in size
Is anyone aware of a workaround to perform raster math in R with integer input and output files while minimizing or avoiding the very large intermediate files?
The trick here could be to use raster::overlay to make the computation and save the results as a compressed tiff at the same time. Something like this should work:
#> Loading required package: sp
# load layers
r.1 <- raster("C:/Users/LB_laptop/Downloads/SNDPPT_M_sl1_1km_ll.tif")
r.2 <- raster("C:/Users/LB_laptop/Downloads/SNDPPT_M_sl1_1km_ll.tif")
out <- raster::overlay(r.1, r.2,
fun = function(x, y) (round((x + y) / 2)),
filename = "C:/Users/LB_laptop/Downloads/SNDPPT_out.tif",
datatype = "INT1U",
> out
class : RasterLayer
dimensions : 16800, 43200, 725760000 (nrow, ncol, ncell)
resolution : 0.008333333, 0.008333333 (x, y)
extent : -180, 180, -56.00083, 83.99917 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
data source : C:\Users\LB_laptop\Downloads\SNDPPT_out.tif
names : SNDPPT_out
values : 0, 242 (min, max)

Raster stack incorrecting plotting latitude and longitude coordinates

I've downloaded precipitation data from the TRMM (rainfall across the tropics) satellite as a netCDF file and have been trying to plot the data in R as a rasterstack. However, R insists on plotting the latitude and longitude axes incorrectly, such that longitude is plotted on the x-axis (as it should be) but uses the latitude coordinates, while latitude is on the y-axis, but uses the longitude coordinates. I've tried using both the plot() and levelplot() functions but neither seems to work. Can anyone help me correct this?
These are the characteristic of the stack:
class : RasterStack
dimensions : 1440, 186, 267840, 12 (nrow, ncol, ncell, nlayers)
resolution : 0.25, 0.25 (x, y)
extent : -23.25, 23.25, -180, 180 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
names : X2016.01.16, X2016.02.15, X2016.03.16, X2016.04.15, X2016.05.16, X2016.06.15, X2016.07.16, X2016.08.16, X2016.09.15, X2016.10.16, X2016.11.15, X2016.12.16
Date : 2016-01-16, 2016-02-15, 2016-03-16, 2016-04-15, 2016-05-16, 2016-06-15, 2016-07-16, 2016-08-16, 2016-09-15, 2016-10-16, 2016-11-15, 2016-12-16
In the following image you can see the current output. It should show rainfall over the tropics from -23 to 23 degrees latitude, and -180 to 180 degrees longitude.
It's odd if the coordinates are switched prior to any processing. Maybe you want to asses the source you downloaded the data from and if there's maybe a better one.
Anyways, (in the meantime) the raster package can be of help for you .. specifically the transpose t() function. Here's an example:
# data before transpose
x <- getData('worldclim',var='tmean',res=10)
# data after transpose
y <- t(x)
There are also a couple of other functions in raster which could be of interest for you: flip and rotate
thanks for your response. It does seem strange that the coordinates are messed up right out of the box, and I did try downloading a fresh set of data and the same problem occurred. However, thanks to your input, I was able to rectify the problem through using the transpose() and flip() functions. I had to transpose the data, then flip it along both the x and y dimensions as the image was 'mirrored'. Here's the code I used in case anyone else encounters this problem with the TRMM data sets:
a.t = t(test.rasterstack)
a.flipy = flip(a.t, direction = 2)
a.t.flipxy = flip(a.t.flipy, direction = 1)

Random sampling from large rasterlayer

I have a large Rasterlayer with integers ranging from 0 to 44.
class : RasterLayer
dimensions : 29800, 34470, 1027206000 (nrow, ncol, ncell)
resolution : 10, 10 (x, y)
extent : 331300, 676000, 5681995, 5979995 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=32 +ellps=GRS80 +units=m +no_defs
data source : /home/mkoehler/stk_rast_whz
names : stk_rast_whz
values : 0, 44 (min, max)
I want to do a stratified sampling of 5000 points per stratum.
I get the following error:
POINTS<-sampleStratified(b, size=5000, na.rm=T, xy=F)
(Error in ys[[i]] <- y : attempt to select less than one element)
Here is a code that reproduces the problems (even when only selecting 1
item per stratum):
r <- raster(ncol=5000, nrow=5000)
names(r) <- 'stratum'
r[] <- round((runif(ncell(r)))*44)
sampleStratified(r, size=1,xy=T)
Error in ys[[i]] <- y : attempt to select less than one element
Trying that with fewer strata and changing the settings of "size" or
"exp" have no effect.
R version: [64-bit] C:\Program Files\R\R-3.1.1
Any ideas?
thanks in advance!
This appears to be a bug (as at raster 2.3-12), and occurs when (1) your raster contains cells with value 0, and (2) the raster can't be processed in memory (i.e. canProcessInMemory(r) is FALSE).
The function loops over the unique cell values produced by freq(r), and then indexes a list by each of these values in turn. If one of those values is zero, the error will be triggered since the 0th element does not exist. For example:
# Error in list()[[0]] : attempt to select less than one element]
You'll notice that the error doesn't occur if you fill r with, e.g., r[] <- sample(44, ncell(r), replace=TRUE), since it won't have any zeroes.
When the raster can be processed in memory, the function loops over the row numbers of freq(r), and so the subsequent list indexing is sensible.
I've contacted the maintainer to report this bug.
Meanwhile, as a temporary fix, you could use something like the following to make a corrected copy of the function (which will remain available in the current R session).
sampleStratified2 <-
eval(parse(text=sub('sr\\[, 2\\] == i', 'sr[, 2] == f[i, 1]',
sub('i in f\\[, 1\\]', 'i in seq_len(nrow(f))',
sampleStratified2(r, size=1, xy=TRUE)
