Larger than memory operations: Spatial joins with R - r

I have 300 million points I want to intersect with 60 million polygons. The combination of these two is larger than what I can easily fit into memory on my machine. I have spiked out a solution where I load each dataset into PostGIS, perform a spatial index on each, then perform the spatial join.
In PostGIS that looks like:
SELECT pts.*, grid.gridID
into test_join
FROM pts, grid
WHERE ST_Contains( grid.geometry, pts.geometry);
The spatial index on pts (300 million points) takes about 90 minutes. Then the join above takes ~190 minutes.
I have never dealt with larger than RAM spatial data with R previously.
Are there ways of dealing with this scale of data using the sf package in R
What strategies exist for speeding up this operation?
Should I be considering other tools or approaches?
My preference is to stay with open source tools (R, PostGIS, Python, etc). But I am not committed to any particular tool chain.
Additional Data
It seems that my lack of illustrating a specific solution has caused confusion. The reason I had not initially provided any syntax or examples is that I am not wedded to a specific platform. I'm open to ideas using any open source stack. As the title says, and I reiterate in the text, the issue here is scale, not syntax to solve a trivial example.
Here is a very specific solution solved using the sf package in R. The example below is for a US grid of 500km square and 1000 random points. I'd like to scale this to sub 1km grids and 300,000,000 points. I don't care about plotting at all but I plot a few things below for illustration only.
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3
library(tidyverse)
library(spData)
#> To access larger datasets in this package, install the spDataLarge
#> package with: `install.packages('spDataLarge',
#> repos='https://nowosad.github.io/drat/', type='source'))`
# size of squares in projection units (in this case meters)
grid_size <- 500000
num_pts <- 1000 # number of points to join
data(us_states) # loads the us_states shape
all_states <-
us_states %>%
# st_sf() %>%
st_transform(102003) %>% # project to a meters based projection
st_combine %>% #flattens the shape file to one big outline (no states)
st_buffer(10000) # add a 10k buffer
#a nice outter buffer of the usa lower 48
ggplot() +
geom_sf(data = all_states)
## let's put a grid over the whole US
state_box <- st_bbox(all_states)
xrange <- state_box$xmax - state_box$xmin
yrange <- state_box$ymax - state_box$ymin
cell_dim <-
c(ceiling(xrange / grid_size),
ceiling(yrange / grid_size)) # dimension of polygons necessary
full_us_grid <-
st_make_grid(all_states, square = TRUE, n = cell_dim) %>%
st_intersection(all_states) %>% # only the inside part
st_sf() %>%
mutate(grid_id = 1:n())
ggplot() +
geom_sf(data = full_us_grid)
## now let's create some random points
random_pts <- data.frame(
point_id = 1:num_pts,
lat = runif(num_pts, 30, 50),
lon = runif(num_pts, -117, -78)
) %>%
# these are in degrees so need crs in same
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
st_transform(102003) # transform into our metric crs
ggplot() +
geom_sf(data = full_us_grid) +
geom_sf(data = random_pts)
## here is the spatial join!!
joined_data <-
full_us_grid %>%
st_join(random_pts)
## this is the mapping from grid_id to point_id
joined_data %>%
st_set_geometry(NULL) %>%
na.omit() %>%
head
#> grid_id point_id
#> 7 7 26
#> 7.1 7 322
#> 7.2 7 516
#> 7.3 7 561
#> 7.4 7 641
#> 7.5 7 680
Created on 2018-12-24 by the reprex package (v0.2.1)

In this particular case (finding which points lie within rectangular cells)
you can get both a speed boost and a reduction of memory requirements by
building a QuadTree using function createTree in package SearchTrees and
then looking for points-in-cell using its rectLookup function.
This way you both spare memory (no need to build a polygon grid), and increase
speed since after building the QuadTreee rectLookup is very fast since it
considerably reduces the number of coordinates comparisons to be done.
For example:
library(sf)
library(spData)
library(SearchTrees)
library(data.table)
library(ggplot2)
data(us_states) # loads the us_states shape
all_states <-
us_states %>%
# st_sf() %>%
st_transform(102003) %>% # project to a meters based projection
st_combine() %>% #flattens the shape file to one big outline (no states)
st_buffer(10000) # add a 10k buffer
# define the grid - no need to create a polygon grid, which is memory intensinve
# for small grids. Just get the bbox, compute number of cells and assign a unique
# index to each
#
grid_size <- 500000
state_box <- st_bbox(all_states)
xrange <- state_box$xmax - state_box$xmin
yrange <- state_box$ymax - state_box$ymin
cell_dim <-
c(ceiling(xrange / grid_size),
ceiling(yrange / grid_size))
n_cells <- cell_dim[1] * cell_dim[2]
ind_rows <- ceiling(1:n_cells / cell_dim[1])
ind_cols <- (1:n_cells) - (ind_rows - 1) * cell_dim[1]
cell_indexes <- data.frame(grid_id = 1:n_cells,
ind_row = ind_rows,
ind_col = ind_cols,
stringsAsFactors = FALSE)
## now let's create some random points - Here I build the points directly in
## 102003 projection for speed reasons because st_transform() does not scale
## very well with number of points. If your points are in 4326 you may consider
## transforming them beforehand and store the results in a RData or gpkg or
## shapefile. I also avoid creating a `sf` object to save memory: a plain x-y-id
## data.table suffices
set.seed(1234)
t1 <- Sys.time()
num_pts <- 3000
random_pts <- data.table::data.table(
point_id = 1:num_pts,
lon = runif(num_pts, state_box$xmin, state_box$xmax),
lat = runif(num_pts, state_box$ymin, state_box$ymax)
)
# Build a Quadtree over the points.
qtree <- SearchTrees::createTree(random_pts, columns = c(2,3))
# Define a function which uses `SearchTrees::rectLookup` to find points within
# a given grid cell. Also deal with "corner cases": cells outside all_states and
# cells only partially within all_states.
find_points <- function(cell, qtree, random_pts, state_box, all_states, grid_size, cell_indexes) {
cur_xmin <- state_box[["xmin"]] + grid_size * (cell_indexes$ind_col[cell] - 1)
cur_xmax <- state_box[["xmin"]] + grid_size * (cell_indexes$ind_col[cell])
cur_ymin <- state_box[["ymin"]] + grid_size * (cell_indexes$ind_row[cell] - 1)
cur_ymax <- state_box[["ymin"]] + grid_size * (cell_indexes$ind_row[cell])
cur_bbox <- sf::st_bbox(c(xmin = cur_xmin, xmax = cur_xmax,
ymin = cur_ymin, ymax = cur_ymax),
crs = sf::st_crs(all_states)) %>%
sf::st_as_sfc()
# look for contained points only if the cell intersects with the all_states poly
if (lengths(sf::st_intersects(cur_bbox, all_states)) != 0) {
if (lengths(sf::st_contains(all_states, cur_bbox)) != 0) {
# If cell completely contained, use `rectLookup` to find contained points
pts <- SearchTrees::rectLookup(
qtree,
xlims = c(cur_xmin, cur_xmax),
ylims = c(cur_ymin, cur_ymax))
} else {
# If cell intersects, but is not completely contained (i.e., on borders),
# limit the rectLookup to the bbox of intersection to speed-up, then find
# points properly contained
cur_bbox <- sf::st_bbox(sf::st_intersection(all_states, cur_bbox))
pts <- SearchTrees::rectLookup(
qtree,
xlims = c(cur_bbox[["xmin"]], cur_bbox[["xmax"]]),
ylims = c(cur_bbox[["ymin"]], cur_bbox[["ymax"]]))
# now we should have "few" points - we can use sf operators - here st_contains
# is much faster than an intersect. This should be fast even over large
# number of points if the cells are small
contained_pts <- sf::st_contains(
all_states,
sf::st_as_sf(random_pts[pts,],
coords = c("lon", "lat"),
crs = sf::st_crs(all_states)))[[1]]
pts <- random_pts[pts[contained_pts],][["point_id"]]
}
if (length(pts) == 0 ) {
pts <- as.numeric(NA)
} else {
pts <- random_pts$point_id[pts]
}
} else {
pts <- as.numeric(NA)
}
out <- data.table::data.table(
grid_id = cell_indexes$grid_id[cell],
point_id = pts)
return(out)
}
Let’see if it works:
# Run the function through a `lapply` over grid cells
out <- lapply(1:n_cells, FUN = function(x)
find_points(x, qtree, random_pts, state_box, all_states, grid_size,cell_indexes))
out <- data.table::rbindlist(out)
out
#> grid_id point_id
#> 1: 1 NA
#> 2: 2 NA
#> 3: 3 NA
#> 4: 4 325
#> 5: 4 1715
#> ---
#> 1841: 59 1058
#> 1842: 60 899
#> 1843: 60 2044
#> 1844: 60 556
#> 1845: 60 2420
grd <- sf::st_make_grid(all_states, cellsize = 500000) %>%
sf::st_sf() %>%
dplyr::mutate(grid_id = 1:60)
id_sub = c(5, 23)
sub_pts <- out[grid_id %in% id_sub]
sub_pts <- dplyr::left_join(sub_pts, random_pts) %>%
sf::st_as_sf(coords = c("lon", "lat"), crs = st_crs(all_states))
#> Joining, by = "point_id"
ggplot2::ggplot(data = grd) +
geom_sf(data = grd, fill = "transparent") +
geom_sf_text(aes(label = grid_id)) +
geom_sf(data = all_states, fill = "transparent") +
geom_sf(data = sub_pts)
In my (limited) experience, this should scale pretty well over number of
points / cells and has a reasonably low memory footprint. In addition, it is easily parallelizable (provided you
have enough memory).
If you still do not manage to run it on the full dataset (I could not test
it on my laptop), you could also “split” the execution by analyzing the points in
“chunks” (for example, by saving them to a shp/gpkg and then reading only a
part of the points using the query argument, or saving as a table ordered by lon
and reading the first XX rows - this could give you a further
speed-up if you filter on longitude/latitude, because then you could also reduce automatically
the number of cells to be analyzed, and save much time.

Try using the cloud solution as described in the link below:
https://blog.sicara.com/speedup-r-rstudio-parallel-cloud-performance-aws-96d25c1b13e2

Related

sf: Generate random points with maximal distance condition

I'd like to generate 100 random points but imposed a maximal distance around points using st_buffer() of size 1000 meters around each point, and eliminating any offending points. But, in my example:
library(sf)
# Data set creation
set.seed(1)
df <- data.frame(
gr = c(rep("a",5),rep("b",5)),
x = rnorm(10),
y = rnorm(10)
)
df <- st_as_sf(df,coords = c("x","y"),remove = F, crs = 4326)
df.laea = st_transform(df,
crs = "+proj=laea +x_0=4600000 +y_0=4600000 +lon_0=0.13 +lat_0=0.24 +datum=WGS84 +units=m")
st_bbox(df.laea)
#
# Random simulation of 100 point inside df.laea extent
sim_study_area <- st_sample(st_as_sfc(st_bbox(df.laea)), 100) %>% # random points, as a list ...
st_sf()
border_area <- st_as_sfc(st_bbox(df.laea))%>% # random points, as a list ...
st_sf()
# I'd like to imposed a maximal distance of 1000 meters around points and for this:
i <- 1 # iterator start
buffer_size <- 1000 # minimal distance to be enforced (in meters)
repeat( {
# create buffer around i-th point
buffer <- st_buffer(sim_study_area[i,], buffer_size)
offending <- sim_study_area %>% # start with the intersection of master points...
st_intersects(buffer, sparse = F) # ... and the buffer, as a vector
# i-th point is not really offending
offending[i] <- TRUE
# if there are any offending points left - re-assign the master points
sim_study_area <- sim_study_area[offending,]
if ( i >= nrow(sim_study_area)) {
# the end was reached; no more points to process
break
} else {
# rinse & repeat
i <- i + 1
}
} )
# Visualizantion of points create with the offending condition:
simulation_area <- ggplot() +
geom_sf(data = border_area, col = 'gray40', fill = NA, lwd = 1) +
geom_sf(data = sim_study_area, pch = 3, col = 'red', alpha = 0.67) +
theme_bw()
plot(simulation_area)
It's not OK result because a don't have 100 points and I don't know how I can fix it.
Please any ideas?
Thanks in advance,
Alexandre
I think that the easiest solution is to adopt one of the sampling functions defined in the R package spatstat. For example:
# packages
library(sf)
#> Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
# create data
set.seed(1)
df <- data.frame(
gr = c(rep("a",5),rep("b",5)),
x = rnorm(10),
y = rnorm(10)
)
df <- st_as_sf(df,coords = c("x","y"),remove = F, crs = 4326)
df.laea = st_transform(
df,
crs = "+proj=laea +x_0=4600000 +y_0=4600000 +lon_0=0.13 +lat_0=0.24 +datum=WGS84 +units=m"
)
Now we sample with a Simple Sequential Inhibition Process. Check ?spatstat.core::rSSI for more details.
sampled_points <- st_sample(
x = st_as_sfc(st_bbox(df.laea)),
type = "SSI",
r = 1000, # threshold distance (in metres)
n = 100 # number of points
)
# Check result
par(mar = rep(0, 4))
plot(st_as_sfc(st_bbox(df.laea)), reset = FALSE)
plot(sampled_points, add = TRUE, pch = 16)
# Estimate all distances
all_distances <- st_distance(sampled_points)
all_distances[1:5, 1:5]
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.00 57735.67 183205.74 189381.50 81079.79
#> [2,] 57735.67 0.00 153892.93 143755.73 61475.85
#> [3,] 183205.74 153892.93 0.00 62696.68 213379.39
#> [4,] 189381.50 143755.73 62696.68 0.00 194237.12
#> [5,] 81079.79 61475.85 213379.39 194237.12 0.00
# Check they are all greater than 1000
sum(all_distances < 1000)
#> [1] 100 # since the diagonal is full of 100 zeros
Created on 2021-08-12 by the reprex package (v2.0.0)
Check here (in particular the answer from Prof. Baddeley), the references therein, and the help page of st_sample for more details.

Rasterizing polygons with complicated weighting

Imagine a regular 0.5° grid across the Earth's surface. A 3x3 subset of this grid is shown below. As a stylized example of what I'm working with, let's say I have three polygons—yellow, orange, and blue—that for the sake of simplicity all are 1 unit in area. These polygons have attributes Population and Value, which you can see in the legend:
I want to turn these polygons into a 0.5° raster (with global extent) whose values are based on the weighted-mean Value of the polygons. The tricky part is that I want to weight the polygons' values based on not their Population, but rather on their included population.
I know—theoretically—what I want to do, and below have done it for the center gridcell.
Multiply Population by Included (the area of the polygon that is included in the gridcell) to get Pop. included. (Assumes population is distributed evenly throughout polygon, which is acceptable.)
Divide each polygon's Included_pop by the sum of all polygons' Included_pop (32) to get Weight.
Multiply each polygon's Value by Weight to get Result.
Sum all polygons' Result to get the value for the center gridcell (0.31).
Population
Value
Frac. included
Pop. included
Weight
Result
Yellow
24
0.8
0.25
6
0.1875
0.15
Orange
16
0.4
0.5
8
0.25
0.10
Blue
18
0.1
1
18
0.5625
0.06
32
0.31
I have an idea of how to accomplish this in R, as described below. Where possible, I've filled in code that I think will do what I want. My questions: How do I do steps 2 and 3? Or is there a simpler way to do this? If you want to play around with this, I have uploaded old_polygons as a .rds file here.
library("sf")
library("raster")
Calculate the area of each polygon: old_polygons$area <- as.numeric(st_area(old_polygons))
Generate the global 0.5° grid as some kind of Spatial object.
Split the polygons by the grid, generating new_polygons.
Calculate area of the new polygons: new_polygons$new_area <- as.numeric(st_area(new_polygons))
Calculate fraction included for each new polygon: new_polygons$frac_included <- new_polygons$new_area / new_polygons$old_area
Calculate "included population" in the new polygons: new_polygons$pop_included <- new_polygons$pop * new_polygons$frac_included
Calculate a new attribute for each polygon that is just their Value times their included population. new_polygons$tmp <- new_polygons$Value * new_polygons$frac_included
Set up an empty raster for the next steps: empty_raster <- raster(nrows=360, ncols=720, xmn=-180, xmx=180, ymn=-90, ymx=90)
Rasterize the polygons by summing this new attribute together within each gridcell. tmp_raster <- rasterize(new_polygons, empty_raster, "tmp", fun = "sum")
Create another raster that is just the total population in each gridcell: pop_raster <- rasterize(new_polygons, empty_raster, "pop_included", fun = "sum")
Divide the first raster by the second to get what I want:
output_raster <- empty_raster
values(output_raster) <- getValues(tmp_raster) / getValues(pop_raster)
Any help would be much appreciated!
Example data:
library(terra)
f <- system.file("ex/lux.shp", package="terra")
v <- vect(f)
values(v) <- data.frame(population=1:12, value=round(c(2:13)/14, 2))
r <- rast(ext(v)+.05, ncols=4, nrows=6, names="cell")
Illustrate the data
p <- as.polygons(r)
plot(p, lwd=2, col="gray", border="light gray")
lines(v, col=rainbow(12), lwd=2)
txt <- paste0(v$value, " (", v$population, ")")
text(v, txt, cex=.8, halo=TRUE)
Solution:
# area of the polygons
v$area1 <- expanse(v)
# intersect with raster cell boundaries
values(r) <- 1:ncell(r)
p <- as.polygons(r)
pv <- intersect(p, v)
# area of the polygon parts
pv$area2 <- expanse(pv)
pv$frac <- pv$area2 / pv$area1
Now we just use the data.frame with the attributes of the polygons to compute the polygon-cover-weighted-population-weighted values.
z <- values(pv)
a <- aggregate(z[, "frac", drop=FALSE], z[,"cell",drop=FALSE], sum)
names(a)[2] <- 'fsum'
z <- merge(z, a)
z$weight <- z$population * z$frac / z$fsum
z$wvalue <- z$value * z$weight
b <- aggregate(z[, c("wvalue", "weight")], z[, "cell", drop=FALSE], sum)
b$bingo <- b$wvalue / b$weight
Assign values back to raster cells
x <- rast(r)
x[b$cell] <- b$bingo
Inspect results
plot(x)
lines(v)
text(x, digits=2, halo=TRUE, cex=.9)
text(v, "value", cex=.8, col="red", halo=TRUE)
This may not scale very well to large data sets, but you could perhaps do it in chunks.
This is fast and scalable:
library(data.table)
library(terra)
# make the 3 polygons with radius = 5km
center_points <- data.frame(lon = c(0.5, 0.65, 1),
lat = c(0.75, 0.65, 1),
Population = c(16, 18, 24),
Value = c(0.4, 0.1, 0.8))
polygon <- vect(center_points, crs = "EPSG:4326")
polygon <- buffer(polygon, 5000)
# make the raster
my_raster <- rast(nrow = 3, ncol = 3, xmin = 0, xmax = 1.5, ymin = 0, ymax = 1.5, crs = "EPSG:4326")
my_raster[] <- 0 # set the value to 0 for now
# find the fractions of cells in each polygon
# "cells" gives you the cell ID and "weights" (or "exact") gives you the cell fraction in the polygon
# using "exact" instead of "weights" is more accurate
my_Table <- extract(my_raster, polygon, cells = TRUE, weights = TRUE)
setDT(my_Table) # convert to datatable
# merge the polygon attributes to "my_Table"
poly_Table <- setDT(as.data.frame(polygon))
poly_Table[, ID := 1:nrow(poly_Table)] # add the IDs which are the row numbers
merged_Table <- merge(my_Table, poly_Table, by = "ID")
# find Frac_included
merged_Table[, Frac_included := weight / sum(weight), by = ID]
# find Pop_included
merged_Table[, Pop_included := Frac_included * Population]
# find Weight, to avoid confusion with "weight" produced above, I call this "my_Weight"
merged_Table[, my_Weight := Pop_included / sum(Pop_included), by = cell]
# final results
Result <- merged_Table[, .(Result = sum(Value * my_Weight)), by = cell]
# add the values to the raster
my_raster[Result$cell] <- Result$Result
plot(my_raster)

Quickly filter down a grid of sf points according to a polygon

I want to make grids (in the sense of data frames of x- and y-coordinates) over the US, or regions of the US, throwing out any points in the original grid that are beyond the borders of the US. I have some code that seems to work, but it's quite slow when I shrink the grid increment down to 1 km (1e3) or so. How can this be done faster? Perhaps there's a way to build the simple feature collection that I need without a lapply or loop, or perhaps this can be done with a MULTIPOINT instead of a simple feature collection of POINTs.
library(sf)
crs.us.atlas = 2163 # https://epsg.io/2163
us = read_sf(paste0("/vsizip/",
"/tmp/us.zip")) # from: https://www2.census.gov/geo/tiger/GENZ2019/shp/cb_2019_us_state_500k.zip
# Filter down to the lower 48 + DC.
us = us[us$STUSPS %in% setdiff(c(state.abb, "DC"), c("AK", "HI")),]
us = st_transform(us, crs = crs.us.atlas)
l = as.list(st_bbox(us))
increment = 1e5
g = expand.grid(
x = seq(l$xmin, l$xmax, by = increment),
y = seq(l$ymin, l$ymax, by = increment))
message("Running the slow part")
print(system.time(g <- g[0 < sapply(FUN = length, st_intersects(
st_as_sfc(crs = crs.us.atlas, lapply(1 : nrow(g), function(i)
st_point(as.numeric(g[i, c("x", "y")])))),
us)),]))
print(nrow(g))
I would solve the problem as follows. First, load the packages. tmap is used just for the map, you can easily ignore that
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(sfheaders)
library(tmap)
Download and read-in data
temp_zip <- tempfile(fileext = ".zip")
download.file(
"https://www2.census.gov/geo/tiger/GENZ2019/shp/cb_2019_us_state_500k.zip",
destfile = temp_zip
)
unzip(temp_zip, exdir = tempdir())
us <- st_read(file.path(tempdir(), "cb_2019_us_state_500k.shp"))
#> Reading layer `cb_2019_us_state_500k' from data source `C:\Users\Utente\AppData\Local\Temp\RtmpCakscO\cb_2019_us_state_500k.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 56 features and 9 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -179.1489 ymin: -14.5487 xmax: 179.7785 ymax: 71.36516
#> geographic CRS: NAD83
Filter down to the lower 48 + DC.
us = us[us$STUSPS %in% setdiff(c(state.abb, "DC"), c("AK", "HI")),]
us = st_transform(us, crs = 2163)
Define increment, grid and sfc object. The key part is to create and sfc_point object for subsequent operations
increment = 1e5
g = expand.grid(
x = seq(st_bbox(us)$xmin, st_bbox(us)$xmax, by = increment),
y = seq(st_bbox(us)$ymin, st_bbox(us)$ymax, by = increment)
)
g_sfc <- sfc_point(as.matrix(g)) %>%
st_set_crs(2163)
Find the ID(s) of points included in the US
my_ids <- unlist(st_contains(us, g_sfc))
Visualise result
tm_shape(g_sfc) +
tm_dots(col = "grey", size = 0.05) +
tm_shape(g_sfc[my_ids]) +
tm_dots(col = "darkred", size = 0.05) +
tm_shape(us) +
tm_borders(lwd = 1.3)
Repeat for 1e3 (but I won't add any plot since that's almost 13 million points)
increment = 1e3
g = expand.grid(
x = seq(st_bbox(us)$xmin, st_bbox(us)$xmax, by = increment),
y = seq(st_bbox(us)$ymin, st_bbox(us)$ymax, by = increment)
)
It takes approximately 20 seconds to generate the data
system.time({
g_sfc <- sfc_point(as.matrix(g)) %>%
st_set_crs(2163)
})
#> user system elapsed
#> 16.29 0.92 17.27
and 80 seconds to find the IDs of the points within US.
system.time({
my_ids <- unlist(st_contains(us, g_sfc))
})
#> user system elapsed
#> 67.75 8.32 80.86
Created on 2021-01-13 by the reprex package (v0.3.0)
If you need something even more efficient, I suggest you polyclip.
Alternatively, you could use the fasterize package to create a raster grid around the shape file with your desired resolution and then extract the grid coordinates using the raster::rasterToPoints function.
This works almost immediately for gathering the xy locations. To convert back to an sf object then takes about 10 seconds or so.
library(sf)
library(fasterize)
library(raster)
crs.us.atlas = 2163
# https://www2.census.gov/geo/tiger/GENZ2019/shp/cb_2019_us_state_500k.zip
us = read_sf(list.files(pattern='.shp$'))
us = us[us$STUSPS %in% setdiff(c(state.abb, "DC"), c("AK", "HI")),]
us = st_transform(us, crs = crs.us.atlas)
increment = 1e3
# create the empty raster to for fasterize base
usrast = raster::raster(us, resolution=rep(increment, 2))
# rasterize the shapefile
r = fasterize::fasterize(us, usrast)
# extract raster cell coordinates
coords = rasterToPoints(r)
# convert coordinates to sf
shp = coords %>%
as.data.frame() %>%
st_as_sf(crs=crs.us.atlas, coords=c('x', 'y'))
sf has a function st_make_grid that could help with this (in older versions of sf, it even automatically cropped to the polygon), but curiously, it's quite slow as of this writing.
I can get reasonable performance without additional dependencies by using st_as_sf to convert g to a simple feature collection of points:
g = st_as_sf(crs = crs.us.atlas, coords = c(1, 2), g)
Then, following #agila's use of unlist, the slow part becomes
print(system.time(g <- g[unlist(st_intersects(us, g)),]))
which, with increment = 1e3, takes 3 minutes on a high-end server.

How to create grid for coordinates for Prediction in R

I'm using Gaussian Process model for prediction, and I'm now at the point where I need to use Grid file based on the coordinates I have in my data but I don't have one and I don't know how to create it.
I followed the post on this link , but it shows the grid on Pennsylvania not Chicago where my data coordinates located!
So I'm confused which will be the ideal way to create grid file including the other columns in the data.
station <- data.frame(lat = c(41.997946, 41.960669, 41.960669, 41.960669,41.909269,41.931841,41.909269,41.910561,41.866129,41.866129), long = c(-87.654561, -87.747456, -87.67459, -87.646438,-87.747456,-87.67459,-87.67459,-87.619112,-87.747456,-87.691617),station = 1:10)
station
lat long station
1 41.99795 -87.65456 1
2 41.96067 -87.74746 2
3 41.96067 -87.67459 3
4 41.96067 -87.64644 4
5 41.90927 -87.74746 5
6 41.93184 -87.67459 6
7 41.90927 -87.67459 7
8 41.91056 -87.61911 8
9 41.86613 -87.74746 9
10 41.86613 -87.69162 10
The data include more columns such, Hour, Day, Moths, Year, Speed, and these observations are for the 10 locations over 2 months period, but I only put the coordinates here to get an idea how to create the grid.
Here's my steps in creating the grid following the link above:
# Set the projection. They were latitude and longitude, so use WGS84 long-lat projection
proj4string(station) <- CRS("+init=epsg:4326")
# View the station location using the mapview function
mapview(station)
#3. Determine the origin
# Set the origin
ori_t <- SpatialPoints(cbind(-87.67459, 41.99795), proj4string = CRS("+init=epsg:4326"))
# Convert the projection of ori
# Use EPSG: 3857 (Spherical Mercator)
ori_t <- spTransform(ori, CRSobj = CRS("+init=epsg:3857"))
coordinates(ori_t)
#ori_t <- spTransform(ori, CRSobj = CRS("+init=epsg:3857"))
#coordinates(ori_t)
# The origin has been rounded to the nearest 100
x_ori <- round(coordinates(ori_t)[1, 1]/100) * 100
y_ori <- round(coordinates(ori_t)[1, 2]/100) * 100
# Define how many cells for x and y axis
x_cell <- 250
y_cell <- 200
# Define the resolution to be 1000 meters
cell_size <- 1000
# Create the extent
ext <- extent(x_ori, x_ori + (x_cell * cell_size), y_ori, y_ori + (y_cell * cell_size))
# Initialize a raster layer
ras <- raster(ext)
# Set the resolution to be
res(ras) <- c(cell_size, cell_size)
ras[] <- 0
# Project the raster
projection(ras) <- CRS("+init=epsg:3857")
# Create interactive map
mapview(station) + mapview(ras)
But when I view the map, the grid is located in Pennsylvania area not Chicago! Do you have an idea why? I picked for my lat : 41.99795, and my long:-87.67459 , and when I put them on Google map, it shows Chicago area , but not showing the same on R!!
# Convert to spatial pixel
st_grid <- rasterToPoints(ras, spatial = TRUE)
gridded(st_grid) <- TRUE
st_grid <- as(st_grid, "SpatialPixels")
Also, when I save the grid file, how can I clue the other columns with the grid coordinates? Because it's only shows the new long and lat columns
write.csv(st_grid, file = "st_grid.csv")
If these are your points
library(sp)
station <- data.frame(lat = c(41.997946, 41.960669, 41.960669, 41.960669,41.909269,41.931841,41.909269,41.910561,41.866129,41.866129), long = c(-87.654561, -87.747456, -87.67459, -87.646438,-87.747456,-87.67459,-87.67459,-87.619112,-87.747456,-87.691617),station = 1:10)
coordinates(station) = ~ long+lat
proj4string(station) <- CRS("+proj=longlat +datum=WGS84")
stp <- spTransform(station, CRSobj = CRS("+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m"))
You can do
library(raster)
r <- raster(stp, res=250)
You can further manipulate it the extent with extend or like this (expand with 10 km, and then round, without changing the resolution)
rr <- setExtent(r, round(extent(r)+10000,-3), keepres=TRUE)
I'm not sure what happened with your code, but it seems like your origin was set incorrectly. I've updated the code above to produce a grid over Chicago. I've picked a random starting point from Google Maps and modified x_cell and y_cell to produce a reasonably sized map over the city.
library(sp)
library(rgdal)
library(raster)
library(leaflet)
library(mapview)
station <- data.frame(lat = c(41.997946, 41.960669, 41.960669, 41.960669,41.909269,41.931841,41.909269,41.910561,41.866129,41.866129),
long = c(-87.654561, -87.747456, -87.67459, -87.646438,-87.747456,-87.67459,-87.67459,-87.619112,-87.747456,-87.691617),
station = 1:10)
coordinates(station) <- ~long + lat
# Set the projection. They were latitude and longitude, so use WGS84 long-lat projection
proj4string(station) <- CRS("+init=epsg:4326")
# View the station location using the mapview function
mapview(station)
#3. Determine the origin
# Set the origin
ori <- SpatialPoints(cbind(-87.872660, 41.619136), proj4string = CRS("+init=epsg:4326"))
# Convert the projection of ori
# Use EPSG: 3857 (Spherical Mercator)
ori_t <- spTransform(ori, CRSobj = CRS("+init=epsg:3857"))
# The origin has been rounded to the nearest 100
x_ori <- round(coordinates(ori_t)[1, 1]/100) * 100
y_ori <- round(coordinates(ori_t)[1, 2]/100) * 100
# Define how many cells for x and y axis
x_cell <- 60
y_cell <- 80
# Define the resolution to be 1000 meters
cell_size <- 1000
# Create the extent
ext <- extent(x_ori, x_ori + (x_cell * cell_size), y_ori, y_ori + (y_cell * cell_size))
# Initialize a raster layer
ras <- raster(ext)
# Set the resolution to be
res(ras) <- c(cell_size, cell_size)
ras[] <- 0
# Project the raster
projection(ras) <- CRS("+init=epsg:3857")
# Create interactive map
mapview(station) + mapview(ras)
This is the image I get in the end:
As for your other question, I'm not sure you're supposed to combine the grid with your data. According to the tutorial linked in the question you've mentioned, krige for example uses both the data meuse and the grid meuse.grid as arguments: lzn.kriged <- krige(log(zinc) ~ 1, meuse, meuse.grid, model=lzn.fit). Check whether this is the case also for the function and the package you're using.
EDIT:
How to pick the origin? The origin in this particular code is the bottom left corner of the grid, so I went to Google Maps and picked a random point that was slightly outside the city limits (based on Google's data), so a bit below and a bit on the left from the limits.

circle around a geographic point with st_buffer

I would like to plot a circle 110 NM (nautical miles) around Dublin airport using sf package.
(Later on I will intersect via st_intersect that with flight position reports from ADS-B.)
I have defined a new unit for NM as follows:
library(units)
library(tidyverse)
library(sf)
NM <- make_unit("NM")
install_conversion_constant("NM", "km", 1.852)
Then defined Dublin airport coordinates:
# DUB/EIDW location, see
# https://skyvector.com/airport/EIDW/Dublin-Airport
# Coordinates:
# N53°25.28' / W6°16.20' (Degrees Decimal Minutes (DDM) format)
# (-6.27, 53.421333) (lon/lat Decimal Degrees (DD))
# Elevation: 242.0 feet (MSL)
dub_lon <- -6.27
dub_lat <- 53.421333
dub_elv <- set_units(242.0, ft)
dub <- st_point( x = c(dub_lon, dub_lat, dub_elv), dim = "XYZ")
dub <- dub %>% st_sfc(crs = 4326)
Hence defined the radius of the circle around the airport (in meters):
r110 <- set_units(110, NM) %>% set_units(km)
Now when I try st_buffer things are not working:
> r110 <- set_units(110, NM) %>% set_units(km)
Error: cannot convert km into °
In addition: Warning message:
In st_buffer.sfc(dub, dist = r110) :
st_buffer does not correctly buffer longitude/latitude data, dist needs to be in decimal degrees.
If I try to pass a numeric value (203.72, these are km) as distance at least I get only a warning:
> dub110 <- st_buffer(dub, dist = 203.72)
Warning message:
In st_buffer.sfc(dub, dist = 203.72) :
st_buffer does not correctly buffer longitude/latitude data, dist needs to be in decimal degrees.
But plotting it shows quite a too big circle
library(mapview)
mapview(dub110)
What are the units for dist I should enter in st_buffer?
I read the documentation but didn't really find out what to do...
Any hints/helps really appreciated!
Thanks to Phil and Jul the complete solution to the initial question is as follows:
library(units)
library(tidyverse)
library(sf)
library(mapview)
library(units)
# define nautical miles (as per ICAO notation)
NM <- make_unit("NM")
install_conversion_constant("NM", "km", 1.852)
# DUB/EIDW location, see
# https://skyvector.com/airport/EIDW/Dublin-Airport
# Coordinates:
# N53°25.28' / W6°16.20' (Degrees Decimal Minutes (DDM) format)
# (-6.27, 53.421333) (lon/lat Decimal Degrees (DD))
# Elevation: 242.0 feet (MSL)
dub_lon <- -6.27
dub_lat <- 53.421333
dub_elv <- set_units(242.0, ft)
dub <- st_point(x = c(dub_lon, dub_lat, dub_elv), dim = "XYZ")
dub <- dub %>% st_sfc(crs = 4326)
# define radious of interest, i.e. 110 NM
r110 <- set_units(110, NM) %>% set_units(km) %>% set_units(m)
# change to Irish grid, which uses meters
dub <- st_transform(dub, 29902)
dub_buffer <- st_buffer(dub, r110)
# eventually convert back to WSG84 if needed for other purposes
dub <- st_transform(dub, 4326)
dub_buffer <- st_transform(dub_buffer, 4326)
mapview(dub_buffer)
Here's a pure sf answer if you prefer, but #Jul 's is perfectly serviceable.
Set up as your example:
library(units)
library(tidyverse)
library(sf)
NM <- make_unit("NM")
install_conversion_constant("NM", "km", 1.852)
# DUB/EIDW location, see
# https://skyvector.com/airport/EIDW/Dublin-Airport
# Coordinates:
# N53°25.28' / W6°16.20' (Degrees Decimal Minutes (DDM) format)
# (-6.27, 53.421333) (lon/lat Decimal Degrees (DD))
# Elevation: 242.0 feet (MSL)
dub_lon <- -6.27
dub_lat <- 53.421333
dub_elv <- set_units(242.0, ft)
dub <- st_point(x = c(dub_lon, dub_lat, dub_elv), dim = "XYZ")
dub <- dub %>% st_sfc(crs = 4326)
Then transform your coordinate to Irish Grid:
dub = st_transform(dub, 29902)
Create your buffer in metres around this point:
dub_buffer = st_buffer(dub, 110000)
Plot the result:
plot(dub_buffer)
plot(dub, add = TRUE)
As mentioned in Phil's, you need to transform your coordinates to a metres/'distance' projection rather than a degree-based projection.
I'm not familiar with sf, but with sp...
library(sp)
dub_transformed <- spTransform(dub,CRS("+init=epsg:29902"))
...before you run the buffer command should suffice.
You may then want to convert the buffered object back to epsg:4326 for plotting/additional processing. e.g.
dub110 <- spTransform(dub110,CRS("+init=epsg:4326"))

Resources