Why intersect function from terra R package not giving all the combinations? - r

I want to calculate the area under every possible combination of two classified rasters. I am using the following code
library(terra)
#First create two rasters
r1 <- r2 <- rast(nrow=100, ncol=100)
#Assign random cell values
set.seed(123)
values(r1) <- runif(ncell(r1), min=0, max=1)
values(r2) <- runif(ncell(r2), min=0, max=1)
# classify the values into two groups
m_r1 <- c(min(global(r1, "min", na.rm=TRUE)), 0.2, 1,
0.2, max(global(r1, "max", na.rm=TRUE)), 2)
m_r2 <- c(min(global(r2, "min", na.rm=TRUE)), 0.2, 1,
0.2, max(global(r2, "max", na.rm=TRUE)), 2)
#Reclassify the rasters
rclmat_r1 <- matrix(m_r1, ncol=3, byrow=TRUE)
rc_r1 <- classify(r1, rclmat_r1, include.lowest=TRUE)
rclmat_r2 <- matrix(m_r2, ncol=3, byrow=TRUE)
rc_r2 <- classify(r2, rclmat_r2, include.lowest=TRUE)
plot(rc_r1)
plot(rc_r2)
#Convert to polygons
r1_poly <- as.polygons(rc_r1, dissolve=TRUE)
r2_poly <- as.polygons(rc_r2, dissolve=TRUE)
plot(r1_poly)
plot(r2_poly)
#Perform intersections
x <- intersect(r1_poly, r2_poly)
x
#> class : SpatVector
#> geometry : polygons
#> dimensions : 2747, 2 (geometries, attributes)
#> extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84
#> names : lyr.1 lyr.1
#> type : <int> <int>
#> values : 1 1
#> 1 2
#> 2 1
As you can see from the output, one combination i.e. 2-2 is missing. Why is this happening?
When I am trying to calculate the area for each combination using expanse(x), it returns a long result. How can I get the area in km2 for the following combinations?
Combination Area (km2)
1-1
1-2
2-1
2-2

With this example it would be better to stay with raster data.
x = 10 * rc_r1 + rc_r2
a = cellSize(x, unit="km")
zonal(a, x, sum)
# lyr.1 area
#1 11 19886611
#2 12 81946082
#3 21 84763905
#4 22 323469024
By multiplying with 10, the values in the first layer become 10 (it they were 1) or 20 (if they were 2). If you then add the second layer, you get 10 + 1 or 2 and 20 + 1 or 2, so you end up with four classes: 11, 12, 21, and 22. These show the value in the first raster (first digit) and in the second raster (second digit).
When you show a SpatVector only the first three records are printed, and there is a 2-2 record. Nevertheless, intersect did not work properly and I have now fixed this.

Related

stack geotiff with stars 'along' when 'band' dimension contains band + time information

I have a timeseries of geotiff files I'd like to stack in R using stars. Here's the first two:
urls <- paste0("/vsicurl/",
"https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/",
"noaa/gefs-v12/cogs/gefs.20221201/",
c("gep01.t00z.pgrb2a.0p50.f003.tif", "gep01.t00z.pgrb2a.0p50.f006.tif"))
library(stars)
stars::read_stars(urls, along="time")
Errors with:
Error in c.stars_proxy(`3` = list(gep01.t00z.pgrb2a.0p50.f003.tif = "/vsicurl/https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/noaa/gefs-v12/cogs/gefs.20221201/gep01.t00z.pgrb2a.0p50.f003.tif"), :
don't know how to merge arrays: please specify parameter along
Context: bands contain both time+band info
This fails because the dimensions do not match, which happens because the files have concatenated temporal information into the band names:
x<- lapply(urls, read_stars)
x
produces:
[[1]]
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
Min. 1st Qu. Median Mean 3rd Qu. Max.
gep01.t00z.pgrb2a.0p50.f003.ti... 50026.01 98094.81 101138 98347.42 101845.2 104605.2
dimension(s):
from to offset delta refsys point
x 1 720 -180.25 0.5 Coordinate System importe... FALSE
y 1 361 90.25 -0.5 Coordinate System importe... FALSE
band 1 8 NA NA NA NA
values x/y
x NULL [x]
y NULL [y]
band PRES:surface:3 hour fcst,...,DLWRF:surface:0-3 hour ave fcst
[[2]]
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
Min. 1st Qu. Median Mean 3rd Qu. Max.
gep01.t00z.pgrb2a.0p50.f006.ti... 50029.83 98101.83 101170.6 98337.52 101825 104588.2
dimension(s):
from to offset delta refsys point
x 1 720 -180.25 0.5 Coordinate System importe... FALSE
y 1 361 90.25 -0.5 Coordinate System importe... FALSE
band 1 8 NA NA NA NA
values x/y
x NULL [x]
y NULL [y]
band PRES:surface:6 hour fcst,...,DLWRF:surface:0-6 hour ave fcst
Note the band names would align except for the existence of the timestamp being tacked on, e.g. PRES:surface:3 hour fcst vs PRES:surface:6 hour fcst.
How can I best read in these files so that I have dimensions of x,y,band, and time in my stars object?
alternatives: terra?
How about terra? Note that terra is happy to read these files in directly, but treats this as 16 unique bands. Can I re-align that so that I have the original 8 bands along a new "time" dimension? (I recognize stars emphasizes 'spatio-temporal', maybe the such a cube is out of scope to terra?) Also note that terra for some reason mangles the timestamp in these band names:
x <- terra::rast(urls)
x
class : SpatRaster
dimensions : 361, 720, 16 (nrow, ncol, nlyr)
resolution : 0.5, 0.5 (x, y)
extent : -180.25, 179.75, -90.25, 90.25 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat Coordinate System imported from GRIB file
sources : gep01.t00z.pgrb2a.0p50.f003.tif (8 layers)
gep01.t00z.pgrb2a.0p50.f006.tif (8 layers)
names : PRES:~ fcst, TMP:2~ fcst, RH:2 ~ fcst, UGRD:~ fcst, VGRD:~ fcst, APCP:~ fcst, .
With terra it is pretty easy to make a time-series for each variable as I show below.
urls <- paste0("/vsicurl/",
"https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/",
"noaa/gefs-v12/cogs/gefs.20221201/",
c("gep01.t00z.pgrb2a.0p50.f003.tif", "gep01.t00z.pgrb2a.0p50.f006.tif"))
library(terra)
r <- rast(urls)
Extract two variables of interest
nms <- names(r)
tmp <- r[[grep("TMP", nms)]]
rh <- r[[grep("RH", nms)]]
# set time
tm <- as.POSIXct("2022-12-01", tz="GMT") + c(3,6) * 3600
time(rh) <- tm
time(tmp) <- tm
And you could combine them into a SpatRasterDatset like this:
s <- sds(list(tmp=tmp, rh=rh))
An alternative path to get to the same point would be to start with a SpatRasterDataset and subset it.
sd <- sds(urls)
nl <- 1:length(sd)
nms <- names(sd[1])
tmp2 <- rast(sd[nl, grep("TMP", nms)])
time(tmp2) <- tm
rh2 <- rast(sd[nl, grep("RH", nms)])
time(rh2) <- tm
I made the subsetting work a little nicer in terra version 1.7-5
urls <- paste0("/vsicurl/",
"https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/",
"noaa/gefs-v12/cogs/gefs.20221201/", c("gep01.t00z.pgrb2a.0p50.f003.tif", "gep01.t00z.pgrb2a.0p50.f006.tif"))
library(terra)
#terra 1.7.5
sd <- sds(urls)
tmp <- sd[,2]
tmp
#class : SpatRaster
#dimensions : 361, 720, 2 (nrow, ncol, nlyr)
#resolution : 0.5, 0.5 (x, y)
#extent : -180.25, 179.75, -90.25, 90.25 (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat Coordinate System imported from GRIB file
#sources : gep01.t00z.pgrb2a.0p50.f003.tif
# gep01.t00z.pgrb2a.0p50.f006.tif
#names : TMP:2 m above g~Temperature [C], TMP:2 m above g~Temperature [C]
#unit : C, C
#time : 2022-12-01 03:00:00 to 2022-12-01 06:00:00 UTC
As for the layer names containing the forecast time, that is just because that is what is in the tif metadata. It looks like that was a decision made when they were created from the original GRIB files.
The latitude extent going beyond the north and south poles is an interesting feature of this dataset.
Just wanted to share some additional possible solutions for comparison. With larger numbers of files some of these differences become more relevant. this expands a bit beyond my original question.
terra
Prof Hijmans gives a very nice solution in terra. He also asked about the original upstream sources, which I didn't explain properly -- these are originally GRIB files for NOAA GEFS forecast.
Notably, we can work directly from the GRIB files. GEFS is a 35-day forecast, so let's try going more than 6 hrs into the future:
library(terra)
# original GRIB sources, AWS mirror
gribs <- paste0("/vsicurl/https://noaa-gefs-pds.s3.amazonaws.com/gefs.20220314/00/atmos/pgrb2ap5/geavg.t00z.pgrb2a.0p50.f",
stringr::str_pad(seq(3,240,by=3), 3, pad="0"))
bench::bench_time({
cube <- terra::sds(gribs)
})
cube[1,63] |> plot()
very nice!
gdalcubes
gdalcubes is another package that can also leverage the gdal virtual filesystem when working with these large-ish remote files. It also lets us define an abstract cube at potentially a different resolution in space & time than the original sources (averaging or interpolating). lazy operations mean this may run a bit faster(?)
library(gdalcubes)
date <- as.Date("2023-01-26")
date_time = date + lubridate::hours(seq(3,240,by=3))
# USA box
v <- cube_view(srs = "EPSG:4326",
extent = list(left = -125, right = -66,top = 49, bottom = 25,
t0= as.character(min(date_time)), t1=as.character(max(date_time))),
dx = 0.5, dy = 0.5, dt = "PT3H")
gribs <- paste0("/vsicurl/https://noaa-gefs-pds.s3.amazonaws.com/gefs.20220314/00/atmos/pgrb2ap5/geavg.t00z.pgrb2a.0p50.f",
stringr::str_pad(seq(3,240,by=3), 3, pad="0"))
bench::bench_time({
cube <- gdalcubes::create_image_collection(gribs, date_time = date_time)
})
bench::bench_time({
raster_cube(cube, v) |>
select_bands("band63") |> # tempearture
animate(col = viridisLite::viridis, nbreaks=50, fps=10, save_as = "temp.gif")
})
stars
didn't translate a full stars example, but here at least is the band name correction; a bit more cumbersome than the examples above.
urls <- paste0("/vsicurl/",
"https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/",
"noaa/gefs-v12/cogs/gefs.20221201/",
c("gep01.t00z.pgrb2a.0p50.f003.tif", "gep01.t00z.pgrb2a.0p50.f006.tif"))
library(stars)
#stars::read_stars(urls, along="time") # no luck!
## grab unstacked proxy object for each geotiff
x <- lapply(urls, read_stars)
# extract band-names-part
band_names <- st_get_dimension_values(x[[1]], "band") |>
stringr::str_extract("([A-Z]+):") |>
str_remove(":")
# apply corrected band-names
x1 <- lapply(x, st_set_dimensions, "band", band_names)
# at last, we can stack into a cube:
x1 <- do.call(c, c(x1, along="time"))
# and add correct date timestamps to the new time dimension
dates <- as.Date("2022-12-01") + lubridate::hours(c(3,6))
x1 <- st_set_dimensions(x1, "time", dates)
x1

How to properly rasterize unregular data points on EASE2 grid?

I was given data with 3 columns (lon, lat, data) which I would like to turn into a SpatRaster.
library(terra)
#> terra 1.5.21
df <- read.csv("data.csv")
head(df)
#> lon lat mean_daily_par_mol
#> 1 -7.52993 61.91259 0.5463984
#> 2 -7.16531 61.77965 0.5635693
#> 3 -6.80426 61.64570 0.6404579
#> 4 -6.44676 61.51075 0.6602439
#> 5 -6.09277 61.37483 0.6801313
#> 6 -5.74228 61.23794 0.6358355
Unfortunately, the data is not on a regular grid, so I can create a raster directly. The data is on the MODIS ISIN grid.
rast(df)
#> Error: [raster,matrix(xyz)] x cell sizes are not regular
e <- ext(c(min(df$lon), max(df$lon), min(df$lat), max(df$lat)))
e
#> SpatExtent : -180, 179.69196, 60.00299, 89.41164 (xmin, xmax, ymin, ymax)
r <- rast(e, ncols = 100, nrows = 100)
r <- rasterize(
as.matrix(df[c("lon", "lat")]),
r,
df$mean_daily_par_mol,
fun = mean,
na.rm = TRUE
)
plot(r)
Obviously, changing ncols and nrows considerably influence how the raster is created.
r <- rast(e, ncols = 300, nrows = 300)
r <- rasterize(
as.matrix(df[c("lon", "lat")]),
r,
df$mean_daily_par_mol,
fun = mean,
na.rm = TRUE
)
plot(r)
My question is how can I find the optimal ncols and nrows so there is approximately 1 values per pixel. The ultimate goal being to reproject this data onto the EASE2 grid.
The data is available here
Created on 2022-05-29 by the reprex package (v2.0.1)

Is it possible to write a single function to create raster files from a data.frame object?

I have loaded the data.frame object in R named "prec" with 1009549 rows and 8 variables. I want to create 60 raster layers of the cumulative "prec" variable values for each x-y coordinates pair at every 4-time step ("tstep" variable, from index 2 to 241) as summarized in the code below. I performed a single function to create each file in 3 steps to achieve it. However, is it possible to write a single function for each step or a single function for the entire code (steps 1 to 4)?
load required packages
library(data.table)
library(raster)
structure of the "prec" data.frame
> headTail(prec)
x y prec index tstep variable level date
1 -47.8 -21.2 0 1 1 prec 1000 2015-01-01 00:00:00
1.1 -47.6 -21.2 0 1 1 prec 1000 2015-01-01 00:00:00
1.2 -47.4 -21.2 0 1 1 prec 1000 2015-01-01 00:00:00
1.3 -47.2 -21.2 0 1 1 prec 1000 2015-01-01 00:00:00
... ... ... ... ... ... <NA> ... <NA>
241.4185 -36.8 -7.2 0 241 241 prec 1000 2015-01-01 00:00:59
241.4186 -36.6 -7.2 0 241 241 prec 1000 2015-01-01 00:00:59
241.4187 -36.4 -7.2 0 241 241 prec 1000 2015-01-01 00:00:59
241.4188 -36.2 -7.2 0 241 241 prec 1000 2015-01-01 00:01:00
step 1: subset by tstep
prec_1 <- prec[prec$tstep %in% c(2, 3, 4, 5),]
prec_2 <- prec[prec$tstep %in% c(6, 7, 8, 9),]
prec_3 <- prec[prec$tstep %in% c(10, 11, 12, 13),]
...
prec_60 <- prec[prec$tstep %in% c( 238 , 239 , 240 , 241),]
step 2: coerce to data.table
prec_1_sum <- setDT(prec_1)[, list(prec_sum_1 = sum(prec*1000)), list(x, y)]
prec_2_sum <- setDT(prec_2)[, list(prec_sum_2 = sum(prec*1000)), list(x, y)]
prec_3_sum <- setDT(prec_3)[, list(prec_sum_3 = sum(prec*1000)), list(x, y)]
...
prec_60_sum <- setDT(prec_60)[, list(prec_sum_60 = sum(prec*1000)), list(x, y)]
step 3: create n raster layers
layer_1 <- rasterFromXYZ(prec_1_sum [,1:3], res = c(0.20, 0.20), crs = sp::CRS("+init=epsg:4326"))
layer_2 <- rasterFromXYZ(prec_2_sum [,1:3], res = c(0.20, 0.20), crs = sp::CRS("+init=epsg:4326"))
layer_3 <- rasterFromXYZ(prec_3_sum [,1:3], res = c(0.20, 0.20), crs = sp::CRS("+init=epsg:4326"))
...
layer_60 <- rasterFromXYZ(prec_60_sum [,1:3], res = c(0.20, 0.20), crs = sp::CRS("+init=epsg:4326"))
step 4: stack raster layers
stack_prec <- stack(layer_1, layer_2, layer_3, layer_4, layer_5, layer_6, layer_7, layer_8, layer_9, layer_10,
layer_11, layer_12, layer_13, layer_14, layer_15, layer_16, layer_17, layer_18, layer_19, layer_20,
layer_21, layer_22, layer_23, layer_24, layer_25, layer_26, layer_27, layer_28, layer_29, layer_30,
layer_31, layer_32, layer_33, layer_34, layer_35, layer_36, layer_37, layer_38, layer_39, layer_40,
layer_41, layer_42, layer_43, layer_44, layer_45, layer_46, layer_47, layer_48, layer_49, layer_50,
layer_51, layer_52, layer_53, layer_54, layer_55, layer_56, layer_57, layer_58, layer_59, layer_60)
It’s always much easier to help when we have sample data we can use. In the future you can use dput(prec) and copy and paste that output for people to use. At the very least some sample data is useful, particularly when you’re using functions that have certain specifications for what the data should look like. Here we generate some data to work with.
library(raster)
#> Loading required package: sp
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following object is masked from 'package:raster':
#>
#> shift
set.seed(1)
dat <-
data.frame(
x = rep(seq(-47.8, -47.2, by = 0.2), 241),
y = -21.2,
prec = runif(964),
tstep = rep(1:241, each = 4),
date = c(rep(as.Date("2015-01-01"), 4), rep(seq(as.Date("2015-01-01"), by = "day", length.out = 60), each = 16))
)
For your process, it seems a bit more straightforward to group the data rather than break it up. That way you only have to perform the operations on one data set rather than do it many times over. Steps 1 and 2 can be reduced to only a few lines that way. Without thinking too much about optimizing this, I’ve looped over the groups created in the first step to create the raster layers.
raster_layers <- function(dat){
## some flexibility if there is a differing number of tsteps
## it will by default exclude the first tstep as in your example
min_tstep <- min(dat$tstep)
max_tstep <- max(dat$tstep)
breaks <- seq(min_tstep, max_tstep, by = 4)
## Step 1
dat$group <- cut(dat$tstep, breaks)
dat <- dat[!is.na(dat$group), ]
## Step 2
prec <- setDT(dat)[ , list(prec_sum = sum(prec * 1000)), by = list(group, x, y)]
## Step 3
layer <- list()
group <- unique(prec$group)
j <- 1
for (i in group){
raster_dat <- prec[prec$group %in% i , c("x", "y", "prec_sum")]
## looks like your plot uses the names for changing labels??
colnames(raster_dat)[colnames(raster_dat) == "prec_sum"] <- paste0("prec_sum_", j)
layer[[j]] <-
rasterFromXYZ(raster_dat,
res = c(0.20, 0.20),
crs = sp::CRS("+init=epsg:4326"))
j <- j + 1
}
## Step 4
stack_prec <- stack(unlist(layer))
return(stack_prec)
}
Example
stack_prec <- raster_layers(dat = dat)
stack_prec
#> class : RasterStack
#> dimensions : 1, 4, 4, 60 (nrow, ncol, ncell, nlayers)
#> resolution : 0.2, 0.2 (x, y)
#> extent : -47.9, -47.1, -21.3, -21.1 (xmin, xmax, ymin, ymax)
#> crs : +init=epsg:4326
#> names : prec_sum_1, prec_sum_2, prec_sum_3, prec_sum_4, prec_sum_5, prec_sum_6, prec_sum_7, prec_sum_8, prec_sum_9, prec_sum_10, prec_sum_11, prec_sum_12, prec_sum_13, prec_sum_14, prec_sum_15, ...
#> min values : 2112.4990, 1124.8232, 2007.5945, 1315.0517, 1729.9294, 1582.8684, 1524.0147, 1098.1529, 2008.5390, 1248.1860, 1680.0199, 1855.4024, 815.4047, 1204.8576, 1416.3943, ...
#> max values : 2336.186, 2565.158, 2877.219, 2318.115, 3017.609, 2540.536, 2569.019, 2690.884, 2327.706, 2288.046, 3104.792, 2639.530, 2358.953, 2599.245, 2618.676, ...

Find coordinates within radius around many starting points in grid

I have a grid of 10x10m coordinates that I extracted from a raster. I have a set of 'starting points'. For each starting point, I want to find the location (coordinates) of cells within a 10-50m radius around it.
I am aware of functions to do this with a raster starting point, but additional analyses that I have not included here require that I perform the search from a grid of coordinates in the format shown below.
The code below achieves my aim, however the outer function produces vectors that are far too large (> 10 Gb) on my actual dataset (which is a grid of 9 million 10x10m cells, with 3000 starting points).
I am looking for alternatives that achieve the same result as the following (simplified) code, but do not require large vector storage or looping over each starting point separately.
library(raster)
library(tidyverse)
#Set up the mock raster
orig=raster(nrows=100, ncols=100)
res(orig)=10
vals <- rep(c(1, 2, 3, 1, 2, 3, 1, 3, 2), times = c(72, 72, 72, 72, 72, 72, 72, 72, 72))
setValues(orig, vals)
values(orig) <- vals
xygrid <- as.data.frame(orig, xy = TRUE) %>% .[,1:2]
head(xygrid)
x y
1 -175 85
2 -165 85
3 -155 85
4 -145 85
5 -135 85
6 -125 85
#the initial starting points
init_locs <- c(5, 10, 15, 20)
#calculate the distance to every surrounding cell from starting point
Rx <- outer(xygrid[init_locs, 1], xygrid[, 1], "-")
Ry <- outer(xygrid[init_locs, 2], xygrid[, 2], "-")
R <- sqrt(Rx^2+Ry^2) #overall distance
for (i in 1:length(R[,1])) {
expr2 <- (R[i,] > 10 & R[i,] <= 50) #extract the location of cells within 10-50m
inv <- xygrid[expr2,] #extract the coordinates of these cells
}
head(inv)
x y
15 -35 85
16 -25 85
17 -15 85
18 -5 85
22 35 85
23 45 85
(Raster and spatial data are not my specialty, but this made me think of a naive approach that might work acceptably. I don't know anything about the methods #Robert Hijmans mentioned, those are likely much more performant. I just thought this sounded like an interesting question to explore with basic methods.)</caveat>
Approach
The main challenge here is you have 9 million cells, but only around 80 of those will be with 50m of any given point. If you calculate all those cells' distances to 3,000 starting points and then filter for those under 50m, that's 9M x 3k = 27 billion calculations, and a gigantic data structure, almost all of which is unnecessary.
We can quickly get ~1,000x more efficient by splitting this into two problems -- first, what general region of potentially-within-50m-points should we look at, and second, what is the actual distance to the points in those regions?
We can precalculate a modestly sized <2MB hash table for step 1. Then, by joining it to our locations (a very fast operation), we can focus our calculations on the 1/1000th of points that have a chance of being within 50m. I arbitrarily split the original cells into 100 x 100 = 10k sectors, each sector holding 30x30 cells.
1. Creating hash table
For the hash table, I'll assign each point to a sector, somewhat arbitrarily as 30x30 cells, so we have 100x100 = 10k sectors. This could be tuned based on speed vs. memory tradeoffs.
max_dist = 30 # sector width, in cells
xygrid2 <- expand_grid(
x = seq(0, 2999, by = 1), # 3000x3000 location grid
y = seq(0, 2999, by = 1))
xygrid2$sector_x = xygrid2$x %/% max_dist # 100 x 100 sectors
xygrid2$sector_y = xygrid2$y %/% max_dist
y_range = max(xygrid2$sector_y) + 1
xygrid2$sector_num = xygrid2$sector_x*y_range + xygrid2$sector_y
We now have 10,000 sectors assigned. Now which sectors are adjacent to which others? In every case, the adjacent sectors follow the same pattern. In this case, I have 100 sectors across x, so the sectors adjacent to sector S will have sector numbers that vary from S by -101 -100 -99 -1 0 1 99 100 101. We can use this pattern to assign all the adjacencies instantaneously. For simplicity, I leave in sectors outside our range; they will be ignored later anyway.
sector_num_deltas <- rep(-1:1, by = 3) + rep(-1:1, each = 3) * y_range
distinct(xygrid2, sector_num) %>%
uncount(9) %>% # copy each row 9 times, one for each adjacency
mutate(sector_num_adj = sector_num + sector_num_deltas) -> adjacencies
2. Join and calculate
Now that we have that, the rest goes much faster, since we can do the calculations only on the 1/1000th of sectors that are nearby. With that, we can now identify the 240,000 points that are within 50m of the 3,000 starting positions in under 4 seconds:
# Here are 3,000 random starting locations
set.seed(42)
sample_starts <- xygrid2 %>%
slice_sample(n = 3000) %>%
mutate(sample_num = row_number())
# Join each location to all the adjacent sectors, and then add all the
# locations within those sectors, and then calculate distances.
sample_starts %>% # 3,000 starting points...
# join each position to the nine adjacent sectors = ~27,000 rows
left_join(adjacencies, by = "sector_num") %>%
# join each sector to the (30x30 = 900) cells in those sectors --> 24 million rows
# That's a lot, but it's only 1/1000th of the starting problem with
# 3k x 9M = 27 billion comparisons!
left_join(xygrid2, by = c("sector_num_adj" = "sector_num")) %>%
select(-contains("sector")) %>%
mutate(dist = sqrt((x.x-x.y)^2 + (y.x-y.y)^2)) %>%
filter(dist <= 5) -> result
The result tells us that our 3,000 sample starting points are within 5 decimeters (50m) of 242,575 cells, about 80 for each starting point.
result
# A tibble: 242,575 x 6
x.x y.x sample_num x.y y.y dist
<dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1069 140 1 1064 140 5
2 1069 140 1 1065 137 5
3 1069 140 1 1065 138 4.47
4 1069 140 1 1065 139 4.12
5 1069 140 1 1065 140 4
6 1069 140 1 1065 141 4.12
7 1069 140 1 1065 142 4.47
8 1069 140 1 1065 143 5
9 1069 140 1 1066 136 5
10 1069 140 1 1066 137 4.24
# … with 242,565 more rows
Here's a sample to see how that's working in a small corner of our data:
ggplot(a %>% mutate(sample_grp = sector_num_adj %% 8 %>% as.factor),
aes(x.y, y.y, color = sample_grp)) +
geom_point(data = adjacencies %>% filter(sector_num_adj == 5864) %>%
left_join(xygrid2) %>% distinct(x, y, sector_num),
color = "gray80", shape = 21,
aes(x, y)) +
geom_point(data = adjacencies %>% filter(sector_num == 5864) %>%
left_join(xygrid2) %>% distinct(x, y, sector_num),
color = "gray70", shape = 21,
aes(x, y)) +
annotate("text", alpha = 0.5,
x = c(1725, 1750),
y = c(1960, 1940),
label = c("Lookup area", "sector of\nstarting location")) +
geom_point(size = 1) +
scale_color_discrete(guide = FALSE) +
coord_equal() -> my_plot
library(gganimate)
animate(
my_plot +
gganimate::view_zoom_manual(pan_zoom = -1, ease = "quadratic-in-out",
xmin = c(0, 1700),
xmax = c(3000, 1800),
ymin = c(0, 1880),
ymax = c(3000, 1980)),
duration = 3, fps = 20, width = 300)
Example data --- you were using a lon/lat example, but based on your code, I am assuming that you are using planar data.
library(raster)
r <- raster(nrows=100, ncols=100, xmn=0, xmx=100, ymn=0, ymx=100, crs="+proj=utm +zone=1 +datum=WGS84")
values(r) <- 1:ncell(r) # for display only
xygrid <- as.data.frame(r, xy = TRUE)[,1:2]
locs <- c(8025, 1550, 5075)
dn <- 2.5 # min dist
dx <- 5.5 # max dist
The simplest approach would be to use pointDistance
p <- xyFromCell(r, locs)
d <- pointDistance(xygrid, p, lonlat=FALSE)
u <- unique(which(d>dn & d<dx) %% nrow(d))
pts <- xygrid[u,]
plot(r)
points(pts)
But you will probably run out of memory with that, and it is inefficient to compute all distance. Instead, you may intersect the points with a buffer around the points of interest
b1 <- buffer(SpatialPoints(p, proj4string=crs(r)), dx)
b2 <- buffer(SpatialPoints(p, proj4string=crs(r)), dn)
b <- erase(b1, b2)
x <- intersect(SpatialPoints(xygrid, proj4string=crs(r)), b)
plot(r)
points(x, cex=.5)
points(xyFromCell(r, locs), col="red", pch="x")
With terra it goes like this -- and works well for large datasets in version 1.1-11 that should be on CRAN this week
library(terra)
rr <- rast(r)
pp <- xyFromCell(rr, locs)
bb1 <- buffer(vect(pp), dx)
bb2 <- buffer(vect(pp), dn)
bb <- erase(bb1, bb2)
xx <- intersect(vect(as.matrix(xygrid)), bb)
You can do similar things with sf.
Given that you have so many data points, you might want to start with removing all points that are clearly not of interest
xySel <- lapply(locs, function(i) {
xy <- xygrid[i,]
s <- xygrid[,1] > xy[,1]-dx & xygrid[,1] < xy[,1]+dx & xygrid[,2] > xy[,2]-dx & xygrid[,2] < xy[,2]+dx
xygrid[s,]
})
xySel = do.call(rbind, xySel)
dim(xySel)
# [1] 363 2
dim(xygrid)
#[1] 10000 2
And now you could run pointDistance as above on all data (or else inside the lapply function)
You say that you need to use points, and not a raster. I have seen that idea many times, and 9 out of 10 times that is wrong. Maybe it is true in your case. For others who stumble upon this question, here are are two raster based approaches.
With the raster package you could use extract( ... ,cellnumbers=TRUE) or ajacent. With adjacent, you would first make a weights matrix using one of the buffers made above
buf <- disaggregate(b)[2,]
rb <- crop(r, buf)
w <- as.matrix(rasterize(buf, rb, background=NA) )
w[6,6]=0
And then use the weight matrix like this
a <- adjacent(r, locs, w, pairs=FALSE)
pts <- xyFromCell(r, a)
plot(r)
points(pts)
With terra you could use the cells method
d <- cells(rr, bb)
xy <- xyFromCell(rr, d[,2])
plot(rr)
points(xy, cex=.5)
lines(bb, col="red", lwd=2)

Sum pixel values in a Raster stack based on another raster stack

I have a raster stack representing Evapotranspiration (ET) with 396 layers (3 raster layers for a month for 11 years in total - 2009 to 2019). For each month the raster layer always represents 1st, 11th and 21st day of the month called dekads. Here is the sample dataset
library(raster)
#create a raster with random numbers
r <- raster(ncol=5, nrow=5, xmx=-80, xmn=-150, ymn=20, ymx=60)
values(r) <- runif(ncell(r))
#create a random raster stack for 3 raster a month for 11 years
n <- 396 #number of raster
s <- stack(replicate(n, r)) # convert to raster stack
#rename raster layers to reflect date
d =rep(c(1,11,21),132)
m =rep(1:12, 11, each =3)
y = rep (2009:2019, each =36)
df.date <- as.Date(paste(y, m, d,sep="-"), "%Y-%m-%d")
names(s) = df.date
I also have two other raster stacks with pixel values representing Season start (ss) 11 layers and season end (se)11 layers for years 2009 to 2019.
#create a raster stack representing season start (ss) and season end (se)
# The pixel value represents dekad number. Each raster layer covers exactly three calendar years with the target year in the middle.
# (1-36 for the first year, 37-72 for the target year, 73-108 for the next year).
ss.1 = r # season start raster
values(ss.1)= as.integer(runif(ncell(ss.1), min=1, max=72))
se.1 = ss.1+10 # season end raster
yr = 11
ss <- stack(replicate(yr, ss.1)) # season start raster stack
se <- stack(replicate(yr, se.1)) #season end rasterstack
Now I need to estimate seasonal sum for each year from the "s" raster stack such that the time period for each pixels to sum should correspond to pixel values from "ss" and "se" by considering a 3 year moving window.
Here is an example of output I need for one time step (3yr window) with one season start (ss) raster and one season end (se) raster. But really struck at looping through three raster stacks (s - representing dataset, ss -representing season start date and se -representing season end date).
Grateful for any help.
# Example to calculate pixel based sum for 1 time step
#subset first 3 years - equal to 108 dekads
s.sub = subset(s, 1:108)
# sum each grid cells of "s" raster stack using "ss.1" and "se.1" as an indicator for the three year subset.
for (i in 1:ncell(s.sub)) {
x[i] <- sum(s[[ss.1[i]:se.1[i]]][i], na.rm = T)
}
The last part of your example did not work and I changed it to this (for one year)
x <- rep(NA, ncell(s))
for (i in 1:ncell(s)) {
x[i] <- sum(s[i][ss.1[i]:se.1[i]], na.rm = T)
}
x <- setValues(ss.1, x)
x
#class : RasterLayer
#dimensions : 5, 5, 25 (nrow, ncol, ncell)
#resolution : 14, 8 (x, y)
#extent : -150, -80, 20, 60 (xmin, xmax, ymin, ymax)
#crs : +proj=longlat +datum=WGS84 +no_defs
#source : memory
#names : layer
#values : 0.6505058, 10.69957 (min, max)
You can get that result like this
idx <- stack(ss.1, se.1)
thefun <- function(x, y){
apply(cbind(y, x), 1, function(i) sum(i[(i[1]:i[2])+2], na.rm = T))
}
z <- overlay(s, idx, fun=thefun)
There are more examples here for a similar question.
Given that this is a general problem, I have added a function rapp (range-apply) for it in terra (the replacement for raster) --- available here; this should be on CRAN in early July.
library(terra)
r <- rast(ncols=5, nrows=5, xmin=-150, xmax=-80, ymin=20, ymax=60)
values(r) <- 1:ncell(r)
s <- rast(replicate(36, r))
ss.1 <- r
values(ss.1) <- as.integer(runif(ncell(ss.1), min=1, max=72))
se.1 <- ss.1+10
x <- rapp(s, ss.1, se.1, sum)

Resources