Can I bound an st_distance call by a polygon? - r

I have seen similar posts on this topic (see, for example, here and here) but not one that is specific to the sf-tidyverse ecosystem.
I have a series of lakes, a series of sample points within each lake, and a "focal point" in each lake that represents where a boat launch is.
I want to calculate the "boater's shortest travel distance" to each sample point from the boat launch. However, I want to somehow "bound" these distances using the lake polygon such that distances cannot be calculated across land. I could imagine this being done by having the "straight line" snake along the lake edge for as long as needed before it can resume being a straight line. I could also imagine the "straight line" being decomposed into line segments that bend around any intervening land. I'm open to a variety of implementations!
I have seen elsewhere (such as here) the idea of converting the bounding polygon to a raster, making the water one value and the land another, much higher value, and then doing a "least-cost distance," where the cost of going over land is prohibitive. However, I don't know how one would actually do this in the raster/sf ecosystem.
Here's the code I used to make this map:
library(dplyr)
library(sf)
library(ggplot2)
Moose.ssw = sswMN.sf %>% filter(lake == "Moose")
Moose.lake = MN_lakes4 %>% filter(str_detect(map_label, "Moose")) %>% filter(cty_name == "Beltrami")
Moose.access = try3 %>% filter(LAKE_NAME == "Moose") %>% filter(COUNTYNAME == "Beltrami")
Moose.box = st_bbox(Moose.ssw)
ggplot() +
geom_sf(data = Moose.lake, color="lightblue") +
geom_sf(data = Moose.access, color = "red", size = 2) +
geom_sf(data = Moose.ssw, mapping = aes(color= Nitellopsis_obtusa_n)) +
coord_sf(xlim = c(Moose.box[1], Moose.box[3]), ylim = c(Moose.box[2], Moose.box[4]))
And here's some code that calculates straight-line distances splendidly--maybe it can be wrappered somehow?
Moose.pt.dists = st_distance(Moose.access, Moose.ssw, by_element = TRUE)
Files needed to make the data objects referenced above can be pulled from my Github page (they are files produced via dput(). Link to my Github.
I am a solid R programmer but I am new to geospatial work, so if I could even just be pointed in a fruitful direction, I may be able to program my own way out of this!

Here's a solution using sfnetworks, which fits in with the tidyverse well.
The code below should
regularly sample the bounding box of the lake (creates evenly-spaced points)
connect them to their closest neighbors
remove the connections that cross land
find the shortest path from the boat launch to the sample location(s) by following the lines that remain.
The results are not exact, but are pretty close. You can increase precision by increasing the number of sampled points. 1000 points are used below.
library(tidyverse)
library(sf)
library(sfnetworks)
library(nngeo)
# set seed to make the script reproducible,
# since there is random sampling used
set.seed(813)
# Getting your data:
x <- dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.lake")
# Subset to get just one lake
moose_lake <- x[5,]
boat_ramp <- dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.access")
sample_locations <- dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.ssw")
sample_bbox <- dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.box")
# sample the bounding box with regular square points, then connect each point to the closest 9 points
# 8 should've worked, but left some diagonals out.
sq_grid_sample <- st_sample(st_as_sfc(st_bbox(moose_lake)), size = 1000, type = 'regular') %>% st_as_sf() %>%
st_connect(.,.,k = 9)
# remove connections that are not within the lake polygon
sq_grid_cropped <- sq_grid_sample[st_within(sq_grid_sample, moose_lake, sparse = F),]
# make an sfnetwork of the cropped grid
lake_network <- sq_grid_cropped %>% as_sfnetwork()
# find the (approximate) distance from boat ramp to point 170 (far north)
pt170 <- st_network_paths(lake_network,
from = boat_ramp,
to = sample_locations[170,]) %>%
pull(edge_paths) %>%
unlist()
lake_network %>%
activate(edges) %>%
slice(pt170) %>%
st_as_sf() %>%
st_combine() %>%
st_length()
#> 2186.394 [m]
Looks like it is about 2186m from the boat launch to sample location 170 at the far north end of the lake.
# Plotting all of the above, path from boat ramp to point 170:
ggplot() +
geom_sf(data = sq_grid_sample, alpha = .05) +
geom_sf(data = sq_grid_cropped, color = 'dodgerblue') +
geom_sf(data = moose_lake, color = 'blue', fill = NA) +
geom_sf(data = boat_ramp, color = 'springgreen', size = 4) +
geom_sf(data = sample_locations[170,], color = 'deeppink1', size = 4) +
geom_sf(data = lake_network %>%
activate(edges) %>%
slice(pt170) %>%
st_as_sf(),
color = 'turquoise',
size = 2) +
theme_void()
Though only the distance from the boat launch to one sample point is found and plotted above, the network is there to find all of the distances. You may need to use *apply or purrr, and maybe a small custom function to find the 'one-to-many' distances of the launch to all sample points.
This page on sfnetworks will be helpful in writing the one-to-many solution.
edit:
To find all distances from the boat launch to the sample points:
st_network_cost(lake_network,
from = boat_ramp,
to = sample_locations)
This is much faster for me than a for loop or the sp solution posted below. Some code in the sampling may need to be adjusted since the st_network_cost will remove any duplicate distances. The sample_locations (or Moose.ssw in #bajcz answer) will need to be cleaned to remove duplicate points as well. There are 180 unique points of the 322 rows.

I was planning to answer my own question today, as I had just managed to get the code using the gdistance and sp method linked to in my question working, but mrhellmann beat me to it! Since their answer works also and is quite nice and elegant, I thought I'd just post code here that utilizes both approaches and compares the outcomes for those who are interested (and in case one or the other doesn't work in your context). They are about equivalently fast, although that's with a for() loop in mrhellmann's sfnetworks version, so it might be faster if that bit can be vectorized, which I'm sure is possible.
#All the hooplah needed to get things started and talking the same language.
x = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.lake")
# Subset to get just one lake
Moose.lake = x[5,]
Moose.access = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.access")
Moose.ssw = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.ssw")
Moose.box = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.box")
library(sf)
library(sp)
library(raster)
library(gdistance)
library(tidyverse)
library(sfnetworks)
library(nngeo)
library(mosaic)
This is "my" way using sp and gdistance tools:
ptm <- proc.time()
#Make all the objects needed into spatial class objects.
test.pts = as(Moose.access, Class = "Spatial")
ssw.pts = as(Moose.ssw, Class = "Spatial")
test.bounds = as(Moose.lake, Class = "Spatial")
#Turn the lake into a raster of resolution 1000 x 1000 (which is arbitrary) and then make all points not in the lake = 0 so that no distances can "afford" to cross over land.
test.raster = raster(extent(test.bounds), nrow=1000, ncol=1000)
lake.raster = rasterize(test.bounds, test.raster)
lake.raster[is.na(lake.raster)] = 0
##For some lakes, although not this one, the public access was just off the lake, which resulted in distances of Inf. This code puts a circular buffer around the dock locations to prevent this. It makes a new raster with 1s at the dock location(s), finds all indices of cells within the buffer distance of each dock (here, 300, which is overkill), and makes the corresponding cells in the lake raster also 1 so that they are considered navigable. This makes the distances slightly inaccurate because it may allow some points to be more easily reachable than they should be, but it is better than the alternative!
access.raster = rasterize(test.pts, lake.raster, field = 1)
index.spots = raster::extract(access.raster, test.pts, buffer=300, cellnumbers = T)
index.spots2 = unlist(lapply(index.spots, "[", , 1))
lake.raster[index.spots2] = 1
#Make a transition matrix so that the cost of moving from one cell to the next can be understood. TBH, I don't understand this part well beyond that.
transition.mat1 = transition(lake.raster, transitionFunction=mean, directions=16) #This code does take a little while.
transition.mat1 = geoCorrection(transition.mat1,type="c")
#Calculates the actual cost-based distances.
dists.test = costDistance(transition.mat1, test.pts, ssw.pts)
proc.time() - ptm #About 55 seconds on my laptop.
And then for comparison, here's the sfnetworks way as provided by mrhellmann.
ptm <- proc.time()
sq_grid_sample = st_sample(st_as_sfc(st_bbox(Moose.lake)), size = 1000, type = 'regular') %>% st_as_sf() %>%
st_connect(.,.,k = 9)
sq_grid_cropped = sq_grid_sample[st_within(sq_grid_sample, Moose.lake, sparse = F)] #I needed to delete a comma in this line to get it to work--I don't think we needed row,column indexing here.
lake_network = sq_grid_cropped %>% as_sfnetwork()
##Using their code to generate all the distances between the dock and the sample points. This might be vectorizable.
dists.test2 = rep(0, nrow(Moose.ssw))
for (i in 1:nrow(Moose.ssw)) {
dist.pt = st_network_paths(lake_network,
from = Moose.access,
to = Moose.ssw[i,]) %>%
pull(edge_paths) %>%
unlist() #This produces a vertices we must go through I think?
dists.test2[i] = lake_network %>%
activate(edges) %>%
slice(dist.pt) %>%
st_as_sf() %>%
st_combine() %>%
st_length()
}
proc.time() - ptm #About 58 seconds on my laptop.
Here are two graphs that I think demonstrate that both approaches produce relatively similar numbers without a lot of signs of bias, although it does look like some of the distances ended up being a little longer for the sfnetworks approach. The first is overlapped density plots of the distance estimates using the two approaches and the second is a "1-to-1" scatterplot of the distances plotted against each other, and you can see that the fit to a 1-to-1 line would be pretty good.
hist.df = data.frame(dists = c(dists.test, dists.test2), version = rep(c("sp", "sfnetworks"), each = 322))
gf_density(~dists, fill=~version, data=hist.df )
gf_point(dists.test2~dists.test)

I know this question has been answered, but I thought I'd throw an alternative method in here anyway. And, it looks like (for me at least) this approach is a little faster than the others. This uses terra, which is the "replacement" for the raster package; many functions have the same name and do the same job, but the main difference is that terra uses SpatRaster objects, whereas raster uses raster objects.
# import data ---
x = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.lake")
# Subset to get just one lake
Moose.lake = x[5,]
Moose.access = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.access")
Moose.ssw = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.ssw")
Moose.box = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.box")
The first thing we do is make the lake into a raster, which is called a SpatRaster by terra. We use the extent (terra::ext()) of Moose.lake, 1000 rows and 1000 cols, and set the crs to the same as Moose.lake. Initially, we give every cell a value of 1, but then we can use terra::mask() to give every value outside Moose.lake a value of 2 (these will be the "high cost" or "no go zone" cells).
# step 1 - rasterize() the lake ---
# find the spatial extent of the lake
ext(Moose.lake) %>%
# make a raster with the same extent, and assign all cells to value of 1
rast(nrow = 1000, ncol = 1000, crs = st_crs(Moose.lake)[1], vals = 1) %>%
# set all cells outside the lake a value of 2
mask(vect(Moose.lake), updatevalue = 2) %>%
{. ->> moose_rast}
moose_rast
# class : SpatRaster
# dimensions : 1000, 1000, 1 (nrow, ncol, nlyr)
# resolution : 2.899889, 3.239947 (x, y)
# extent : 388474, 391373.9, 5265182, 5268422 (xmin, xmax, ymin, ymax)
# coord. ref. : +proj=utm +zone=15 +datum=NAD83 +units=m +no_defs
# source : memory
# name : lyr.1
# min value : 1
# max value : 2
plot(moose_rast)
Then, we give the cell where Moose.access is a value of 3 (or any value other than 1 or 2) - this will be our starting point.
# now, make the boat ramp a value of 3 (this is the starting point)
# find the cell number based on the coordinates of Moose.access
moose_rast %>%
cellFromXY(st_coordinates(Moose.access)) %>%
{. ->> access_cell}
access_cell
# [1] 561618
# assign that cell a value of 3 (or any number you want other than 1 or 2)
values(moose_rast)[access_cell] <- 3
# check it
moose_rast %>% plot
Although you can't really see it on the plot because the cell is so small, we can tell that the cell value has changed because our legend now includes a value of 3.
Next up, we use terra::gridDistance() to find the distance from the starting cell (value of 3), to every other cell. We specify origin = 3 to assign this cell as the starting point, and we tell it not to traverse any cells with value of 2 using omit = 2. This function returns a SpatRaster, but this time the cell values are the distances to the origin.
# now, find distances from the access_cell to every other cell
moose_rast %>%
gridDistance(origin = 3, omit = 2) %>%
{. ->> moose_rast_distances}
moose_rast_distances
# class : SpatRaster
# dimensions : 1000, 1000, 1 (nrow, ncol, nlyr)
# resolution : 2.899889, 3.239947 (x, y)
# extent : 388474, 391373.9, 5265182, 5268422 (xmin, xmax, ymin, ymax)
# coord. ref. : +proj=utm +zone=15 +datum=NAD83 +units=m +no_defs
# source : memory
# name : lyr.1
# min value : 0
# max value : 2424.096
# check it
moose_rast_distances %>% plot
In this plot, the areas closest to the access cell are white, and those farthest away are green.
So now all we have to do is find the cell numbers of the sample sites within the lake and pull out their cell values. We use terra::cellFromXY() to find the cell numbers, based on a set of XY coordinates.
# now, find the cell numbers of all the study sites
moose_rast %>%
cellFromXY(st_coordinates(Moose.ssw)) %>%
{. ->> site_cells}
# cell numbers
site_cells %>%
head(50)
# [1] 953207 953233 930156 930182 930207 930233 907156 907182 907207 884078 884130
# [12] 884156 884182 884207 861052 861078 861104 861130 861156 861182 861207 837026
# [23] 837052 837078 837104 837130 837156 837182 837207 814026 814052 814078 814104
# [34] 814130 814156 814182 814207 791026 791104 791130 791156 791182 768182 745182
# [45] 745207 722026 722233 699259 675052 675285
# and now their distance values
values(moose_rast_distances)[site_cells] %>% head(50)
# [1] 2212.998 2241.812 2144.020 2115.206 2138.479 2167.293 2069.501 2040.688 2063.960
# [10] 2081.424 2023.796 1994.983 1966.169 1989.441 2078.719 2006.905 1978.092 1949.278
# [19] 1920.464 1891.650 1914.923 2119.358 2043.960 1968.563 1900.333 1871.519 1842.705
# [28] 1813.891 1837.164 2086.047 2010.650 1935.253 1859.856 1797.000 1768.186 1739.372
# [37] 1762.645 2052.736 1826.545 1751.148 1693.667 1664.854 1590.335 1533.733 1488.110
# [46] 1952.805 1384.778 1281.445 1809.338 1174.872
And to make the distances a little more user friendly, we can instead put them into a new column in the sf collection.
# now, make a new column in the study sites which is the distance to the access_cell
Moose.ssw %>%
rowwise %>%
mutate(
distance_to_access = cellFromXY(moose_rast, st_coordinates(geometry)) %>% values(moose_rast_distances)[.]
) %>%
select(distance_to_access, everything())
# Simple feature collection with 322 features and 122 fields
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: 388549 ymin: 5265332 xmax: 391324 ymax: 5268332
# CRS: +proj=utm +zone=15 +datum=NAD83 +unit=m
# # A tibble: 322 x 123
# # Rowwise:
# distance_to_access lake county dow lake_size_acres contact year_discovered first_year_trea~ survey_year
# <dbl> <chr> <chr> <int> <dbl> <chr> <int> <chr> <int>
# 1 2213. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 2 2242. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 3 2144. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 4 2115. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 5 2138. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 6 2167. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 7 2070. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 8 2041. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 9 2064. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 10 2081. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# # ... with 312 more rows, and 114 more variables: survey_dates <chr>, surveyor <chr>, pointid <int>, depth_ft <dbl>,
# # Myriophyllum_spicatum_or_hybrid_n <int>, Potamogeton_crispus_n <int>, Bidens_beckii_n <int>,
# # Brasenia_schreberi_n <int>, Ceratophyllum_demersum_n <int>, Ceratophyllum_echinatum_n <int>, Chara_sp_n <int>,
# # Eleocharis_acicularis_n <int>, Eleocharis_palustris_n <int>, Elodea_canadensis_n <int>,
# # Elodea_nuttallii_n <int>, Heteranthera_dubia_n <int>, Lemna_minor_n <int>, Lemna_trisulca_n <int>,
# # Myriophyllum_heterophyllum_n <int>, Myriophyllum_sibiricum_n <int>, Myriophyllum_verticillatum_n <int>,
# # Najas_flexilis_n <int>, Najas_gracillima_n <int>, Najas_guadalupensis_n <int>, Najas_marina_n <int>, ...
The distances are in metres because the CRS is UTM.
Additionally, gridDistance() could also be used to find the shortest distance from every point within the lake, to the lake'e edge. To do this we say origin = 2, which rather than just a single point like our boat ramp earlier, is every cell on land.
moose_rast %>%
gridDistance(origin = 2) %>%
plot
So I ran the two approaches in your answer above, and compared the results and times to this method.
Time-wise, the terra approach was much faster, on my computer anyway:
sp ~230 secs
sfnetworks ~83 secs
terra ~5 secs
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# sp
#All the hooplah needed to get things started and talking the same language.
x = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.lake")
# Subset to get just one lake
Moose.lake = x[5,]
Moose.access = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.access")
Moose.ssw = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.ssw")
Moose.box = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.box")
library(sf)
library(sp)
library(raster)
library(gdistance)
library(tidyverse)
library(sfnetworks)
library(nngeo)
library(mosaic)
ptm <- proc.time()
#Make all the objects needed into spatial class objects.
test.pts = as(Moose.access, Class = "Spatial")
ssw.pts = as(Moose.ssw, Class = "Spatial")
test.bounds = as(Moose.lake, Class = "Spatial")
#Turn the lake into a raster of resolution 1000 x 1000 (which is arbitrary) and then make all points not in the lake = 0 so that no distances can "afford" to cross over land.
test.raster = raster(extent(test.bounds), nrow=1000, ncol=1000)
lake.raster = rasterize(test.bounds, test.raster)
lake.raster[is.na(lake.raster)] = 0
##For some lakes, although not this one, the public access was just off the lake, which resulted in distances of Inf. This code puts a circular buffer around the dock locations to prevent this. It makes a new raster with 1s at the dock location(s), finds all indices of cells within the buffer distance of each dock (here, 300, which is overkill), and makes the corresponding cells in the lake raster also 1 so that they are considered navigable. This makes the distances slightly inaccurate because it may allow some points to be more easily reachable than they should be, but it is better than the alternative!
access.raster = rasterize(test.pts, lake.raster, field = 1)
index.spots = raster::extract(access.raster, test.pts, buffer=300, cellnumbers = T)
index.spots2 = unlist(lapply(index.spots, "[", , 1))
lake.raster[index.spots2] = 1
#Make a transition matrix so that the cost of moving from one cell to the next can be understood. TBH, I don't understand this part well beyond that.
transition.mat1 = transition(lake.raster, transitionFunction=mean, directions=16) #This code does take a little while.
transition.mat1 = geoCorrection(transition.mat1,type="c")
#Calculates the actual cost-based distances.
dists.test = costDistance(transition.mat1, test.pts, ssw.pts)
proc.time() - ptm
# 234 secs
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# sfnetworks
#All the hooplah needed to get things started and talking the same language.
x = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.lake")
# Subset to get just one lake
Moose.lake = x[5,]
Moose.access = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.access")
Moose.ssw = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.ssw")
Moose.box = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.box")
library(sf)
library(sp)
library(raster)
library(gdistance)
library(tidyverse)
library(sfnetworks)
library(nngeo)
library(mosaic)
ptm <- proc.time()
sq_grid_sample = st_sample(st_as_sfc(st_bbox(Moose.lake)), size = 1000, type = 'regular') %>% st_as_sf() %>%
st_connect(.,.,k = 9)
sq_grid_cropped = sq_grid_sample[st_within(sq_grid_sample, Moose.lake, sparse = F)] #I needed to delete a comma in this line to get it to work--I don't think we needed row,column indexing here.
lake_network = sq_grid_cropped %>% as_sfnetwork()
##Using their code to generate all the distances between the dock and the sample points. This might be vectorizable.
dists.test2 = rep(0, nrow(Moose.ssw))
for (i in 1:nrow(Moose.ssw)) {
dist.pt = st_network_paths(lake_network,
from = Moose.access,
to = Moose.ssw[i,]) %>%
pull(edge_paths) %>%
unlist() #This produces a vertices we must go through I think?
dists.test2[i] = lake_network %>%
activate(edges) %>%
slice(dist.pt) %>%
st_as_sf() %>%
st_combine() %>%
st_length()
}
proc.time() - ptm
# 83 secs
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# terra
#All the hooplah needed to get things started and talking the same language.
x = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.lake")
# Subset to get just one lake
Moose.lake = x[5,]
Moose.access = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.access")
Moose.ssw = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.ssw")
Moose.box = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.box")
library(tidyverse)
library(sf)
library(terra)
ptm <- proc.time()
# make rastre of moose lake
ext(Moose.lake) %>%
# make a raster with the same extent, and assign all cells to value of 1
rast(nrow = 1000, ncol = 1000, crs = st_crs(Moose.lake)[1], vals = 1) %>%
# set all cells outside the lake a value of 2
mask(vect(Moose.lake), updatevalue = 2) %>%
{. ->> moose_rast}
# find access cell
moose_rast %>%
cellFromXY(st_coordinates(Moose.access)) %>%
{. ->> access_cell}
# assign that cell a value of 3 (or any number you want other than 1 or 2)
values(moose_rast)[access_cell] <- 3
# find values to every other cell
moose_rast %>%
gridDistance(origin = 3, omit = 2) %>%
{. ->> moose_rast_distances}
# make column with distances to each sample site
distances <- Moose.ssw %>%
rowwise %>%
mutate(
distance_to_access = cellFromXY(moose_rast, st_coordinates(geometry)) %>% values(moose_rast_distances)[.]
) %>%
select(distance_to_access, everything())
proc.time() - ptm
# 5 secs
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
hist.df2 = data.frame(dists = c(dists.test, dists.test2, distances$distance_to_access), version = rep(c("sp", "sfnetworks", "terra"), each = 322))
gf_density(~dists, fill=~version, data=hist.df2)
And the distances values from the terra method are pretty similar to the sp method:

Related

How to plot coordinates on a spherical shape file correctly?

So I'm trying to plot a bunch of coordinates on the earth and track how many coordinates are in each country. I have plotted the map and coordinates fine, but when I try to use the intersection to count how many coordinates fall within each country (polygon) it results in error. I've tried using the st_make_valid function to fix the earth shape file but it messes up the geometry. I'm new to using R so any help would be greatly appreciated.
I have used the following code to plot the earth shape file and the coordinates on top:
library(tidyverse)
library(sf)
library(rmapshaper)
library(rnaturalearth)
library(rnaturalearthdata)
library(sp)
library(raster)
###############
# Load Data
###############
# Read in data from .csv file
MeteoriteData <- read.csv("C:/Users/ChaseDickson_/Desktop/College/AERO 689/Semester Project/Meteorite Landings.csv")
# Convert these points to an SF object, specifying the X and Y
# column names, and supplying the CRS as 4326 (which is WGS84)
MeteoriteData.sf <- st_as_sf(MeteoriteData, coords=c('long', 'lat'), crs=4326)
world <- (ne_countries(scale = "medium", returnclass = "sf"))
MeteoriteMap <- ggplot(data = world) +
geom_sf() +
geom_sf(data = MeteoriteData.sf, size = 0.5, shape = 23, fill = "darkred") +
theme_bw()
MeteoriteMap
Which gives the following plot
However, when getting the intersection of the code I used this
intersection <- st_intersection(x = world, y = MeteoriteData.sf)
But it gave the error
Error in wk_handle.wk_wkb(wkb, s2_geography_writer(oriented = oriented, :
Loop 96 is not valid: Edge 743 crosses edge 998
To fix this I changed the world sf by adding st_make_valid like so:
world <- st_make_valid(ne_countries(scale = "small", returnclass = "sf"))
Now this allows the intersection function to work as such:
intersection <- st_intersection(x = world, y = MeteoriteData.sf)
int_result <- intersection %>%
group_by(sovereignt) %>%
count()
And the output is recorded shown below
However, this messes the countries (polygons) up in the plot and will give inaccurate data as the new earth shape file is shown below:
Any help figuring out how to maintain the first plot, but still get the intersection function and count to work after adding st_make_valid would be greatly appreciated!
The {rnaturalearth} package has had a long and productive history, but - kind of like the similar {maps} package - it belongs to a different, less demanding era. You should consider doing a Marie Kondo on it: thank it for its service, and let it go.
So instead of trying to repair its failings look for a different instance of world dataset, which is a very common and thus standardized use case.
Consider this piece of code, and note that it is not a single piece of wrong geometry, but 6 (out of 241). Correcting them one by one would be a fruitless task.
library(sf)
rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")|>
st_is_valid() |>
table()
# FALSE TRUE
# 6 235
My preferred source of the world country data is the {giscoR} package, which interfaces GISCO spatial dataset, ultimately maintained by Eurostat.
It is very handy, known to be valid and actively maintained.
giscoR::gisco_get_countries(resolution = "20") |>
st_is_valid() |>
table()
# TRUE
# 257
The rest of your code - the intersection and plotting part - should work just fine once you get rid of invalid geometries.
Just a note regarding st_intersection() (and shamelessly building on top of Jindra's answer.. ) -
this happens to be an extremely expensive method to retrieve hit count, especially if resulting geometries are just a by-product that will be disregarded (and it's expensive for this too, check st_join())
For counting you can save a lot (like ~250ms vs ~16s(!) for 100 random points on my machine) by opting for st_intersects():
library(sf)
library(ggplot2)
countries <- giscoR::gisco_get_countries(resolution = "20")
set.seed(1)
random_points <- data.frame(x = runif(100,-180,180), y = runif(100,-90,90)) |>
st_as_sf(coords = c("x","y"), crs = "WGS84")
#> Measure st_intersection():
system.time({
countries_isect <- st_intersection(countries, random_points)
})
#> user system elapsed
#> 15.97 0.47 16.50
#> Measure st_intersects():
system.time({
countries$hits <- lengths(st_intersects(countries, random_points))
})
#> user system elapsed
#> 0.22 0.00 0.22
ggplot(countries) +
geom_sf(data = countries) +
geom_sf(data = random_points) +
theme_void()
Results, i.e. non-zero lengths(st_intersects(countries, random_points)):
countries[countries$hits > 0,c("NAME_ENGL","hits")] |> st_drop_geometry()
#> NAME_ENGL hits
#> 4 Antarctica 7
#> 13 Australia 1
#> 30 Brazil 4
#> 31 China 3
#> 48 Greenland 1
#> 55 Canada 3
#> 116 Kazakhstan 1
#> 117 Laos 1
#> 125 Cambodia 1
#> 141 Mauritania 1
#> 155 Oman 1
#> 162 Paraguay 1
#> 184 Mali 1
#> 185 Russian Federation 5
#> 221 United States 3
#> 223 Venezuela 1
#> 249 Thailand 1
Created on 2022-11-23 with reprex v2.0.2

Geosphere package distance from point to polygon

I'm trying to use the geosphere package in R to get the distance to a polygon from a set of points that lie outside of that polygon.
The polygon is a shapefile of the Antarctic coastline, found here: https://data.bas.ac.uk/items/e6cf8946-e493-4c36-b4f5-58f7a2ee2a74/ and the points are animal tracking data.
I have tried using the syntax specified in the geosphere documentation (https://www.rdocumentation.org/packages/geosphere/versions/1.5-14/topics/dist2Line)
which is as follows:
dist2Line(p, line, distfun=distGeo)
#my attempt so far:
#libraries
library(rgdal)
library(sf)
library(rgeos)
library(tidyverse)
library(geosphere)
#my points
points <-read.csv("Analyses/example_points.csv") #this is the table included below of 4 example locations.
|ID|LON |LAT |
|--|----------|----------|
|a |-2.515478 |-69.53887 |
|b |-2.601405 |-69.79783 |
|c |-0.153548 |-69.45126 |
|d |26.06987 |-69.55020 |
#my line
line <- <- readOGR('Environmental_Data/COAST/add_coastline_high_res_polygon_v7_5.shp/') #this is the shapefile linked above
#convert points to spatial object
coordinates(points) <- ~LON+LAT
distance <- geosphere::dist2Line(p = points, line = line, distfun = distGEO)
However, I get an error: "Error in .spDistPoint2Line(p, line, distfun) :
Points are projected. They should be in degrees (longitude/latitude)".
The package documentation states that p can be:
"longitude/latitude of point(s). Can be a vector of two numbers, a matrix of 2 columns (first one is longitude, second is latitude) or a SpatialPoints object*" - which is what I'm providing it with. I have seen the same issue encountered on a Reddit post (unanswered) but not on here.
My desired output is as below (distances under distance to coast are made up for now!). I have ~3000 locations I need to find the distance to the coastline for.
ID
LON
LAT
Dist_to_coast (km)
a
-2.515478
-69.53887
40
b
-2.601405
-69.79783
24
c
-0.153548
-69.45126
74
d
26.06987
-69.55020
23
Is there an alternative/better means of doing this?
Thank you.
You have loaded sf, any particular reason for not using sf::st_distance() for the task? Would still need to transform though, as there are 4 sample points vs ~140MB shapefile with ~17000 polygons, points were transformed:
library(ggplot2)
library(dplyr)
library(sf)
coastline <- st_read("add_coastline_high_res_polygon_v7_6.shp/")
p <- readr::read_delim(
"ID|LON |LAT
a |-2.515478 |-69.53887
b |-2.601405 |-69.79783
c |-0.153548 |-69.45126
d |26.06987 |-69.55020" , delim = "|", trim_ws = T) %>%
st_as_sf(coords = c("LON", "LAT"), crs = "WGS84") %>%
# transform points to match crs of the shapefile
st_transform(st_crs(coastline))
# number of different surface polygons
table(coastline$surface)
#>
#> ice shelf ice tongue land rumple
#> 325 37 17233 64
# create a single multipolygon, can take a while;
# you may need to filter first to define any surface types you might want to
# include / exclude ("land" also includes islands)
system.time({
ucoastline <- st_union(coastline)
})
#> user system elapsed
#> 103.40 11.72 116.08
p$dist_ucoastline <- st_distance(p,ucoastline)
# or perhaps select land polygon with max area to
# ignore ice and all the islands:
land_max <- coastline %>%
slice_max(st_area(.))
p$land_max <- st_distance(p,land_max)
ggplot() +
geom_sf(data = st_simplify(ucoastline,dTolerance = 1000), fill = "lightblue", color = NA) +
geom_sf(data = st_simplify(land_max,dTolerance = 1000), fill = "gray70") +
geom_sf(data = p, shape =4, color="red", size = 5) +
theme_bw()
Result:
# convert coordinates back to WGS84,
# geometries to coordinate columns
bind_cols(
st_transform(p, crs = "WGS84") %>% st_coordinates(),
st_drop_geometry(p)
)
#> # A tibble: 4 × 5
#> X Y ID dist_ucoastline[,1] land_max[,1]
#> <dbl> <dbl> <chr> [m] [m]
#> 1 -2.52 -69.5 a 40742. 180479.
#> 2 -2.60 -69.8 b 39750. 157043.
#> 3 -0.154 -69.5 c 6629. 186878.
#> 4 26.1 -69.6 d 45683. 121500.
Created on 2022-11-23 with reprex v2.0.2

Average points together without repeating and reduce final dataframe

The goal is to average points together within 10 meters without repeating any points in the averaging, reduce the point dataframe to the averaged points, and to ideally obtain a smooth flow of points along the routes said points were collected. Here is an 11 point subset example dataframe from a much larger file (25,000 observations):
library(sf)
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)),
datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,
coords = c("lon", "lat"),
crs = projcrs)
Here is what I've tried:
Many iterations of st_is_within_distance(trav, trav, tolerance) including:
an aggregate method shown here. These don't work because the same points get averaged multiple times.
Got close with filter and across by trying to dynamically update a list in lapply but didn't work out in the end.
This is helpful from #jeffreyevans, but doesn't really solve the problem and is a bit outdated.
The spThin package doesn't work because it's made for more specific variables.
I thought to cluster using this post, but the clusters throw random points and don't actually reduce the dataframe efficiently.
Here is as close as I've gotten. Again, the issue with this solution is it repeats points in collecting averages, which gives more weight to certain points than others.
# first set tolerance
tolerance <- 20 # 20 meters
# get distance between points
i <- st_is_within_distance(df, df, tolerance)
# filter for indices with more than 1 (self) neighbor
i <- i[which(lengths(i) > 1)]
# filter for unique indices (point 1, 2 / point 2, 1)
i <- i[!duplicated(i)]
# points in `sf` object that have no neighbors within tolerance
no_neighbors <- trav[!(1:nrow(df) %in% unlist(i)), ]
# iterate over indices of neighboring points
avg_points <- lapply(i, function(b){
df <- df[unlist(b), ]
coords <- st_coordinates(df)
df <- df %>%
st_drop_geometry() %>%
cbind(., coords)
df_sum <- df %>%
summarise(
datetime = first(datetime),
trait = mean(trait),
X = mean(X),
Y = mean(Y),
.groups = 'drop') %>%
ungroup()
return(df)
}) %>%
bind_rows() %>%
st_as_sf(coords = c('X', 'Y'),
crs = "+proj=longlat +datum=WGS84 +no_defs ")
Another answer, using sf::aggregate() and a hexagonal grid to find points that are within a particular distance from each other. A square grid could be used as well. Results will vary some depending on where exactly the grid falls in relation to the points, but no point should be used more than once in determining the mean.
Outline of the steps:
load data, transform to crs 5070 for measurements in meters
get a bounding box of the data
make a grid of hexagons of the bounding box of ~10m diameter each
aggregate points falling in the same hexagon using mean
join to original data
library(sf)
library(tidyverse)
set.seed(22) # might be needed to get same hex grid?
#### your sample data
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)),
datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,
coords = c("lon", "lat"),
crs = projcrs) %>%
st_transform(5070) ### transform to 5070 for a projection in meters
#### end sample data
# Get a bounding box as an sf object to make a grid
bbox <- st_bbox(df) %>% st_as_sfc()
# Make a grid as hexagons with approximately the right size
# area ~86m; side ~5.75m; long diag ~11.5m
hex_grid <- st_make_grid(bbox, cellsize = 10, square = F) %>% st_as_sf()
# Aggregate mean of the hexagonal grid
hex_agg <- aggregate(df ,
hex_grid,
mean,
join = st_contains) %>% filter(!is.na(trait))
# Assign the mean of the hexagon to points that fall
# within each hexagon
df_agg <- st_join(df, hex_agg)
head(df_agg) # trait.x from df, trait.y from the mean by hexagon
#> Simple feature collection with 6 features and 4 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 121281.6 ymin: 1786179 xmax: 121285.3 ymax: 1786227
#> Projected CRS: NAD83 / Conus Albers
#> trait.x datetime.x trait.y datetime.y
#> 1 91.22 2021-08-06 15:08:43 91.70500 2021-08-06 15:26:00
#> 2 91.22 2021-08-06 15:08:44 91.32667 2021-08-06 15:31:47
#> 3 91.22 2021-08-06 15:08:46 91.22000 2021-08-06 15:08:46
#> 4 91.58 2021-08-06 15:08:47 91.58000 2021-08-06 15:08:47
#> 5 91.47 2021-08-06 15:43:17 91.47000 2021-08-06 15:43:17
#> 6 92.19 2021-08-06 15:43:18 91.70500 2021-08-06 15:26:00
#> geometry
#> 1 POINT (121282.5 1786184)
#> 2 POINT (121283.2 1786194)
#> 3 POINT (121284.6 1786216)
#> 4 POINT (121285.3 1786227)
#> 5 POINT (121281.7 1786179)
#> 6 POINT (121281.6 1786186)
sum(df_agg$trait.x) - sum(df_agg$trait.y) # original trait - aggregate trait should be 0, or near 0
#> [1] 0
ggplot(df_agg) +
geom_sf(aes(size = trait.x), alpha = .2, color = 'blue') + # Original triat
geom_sf(aes(size = trait.y), alpha = .2, color = 'red') + # New aggregated trait
theme_void()
Sized by trait. Blue points are original, red is the new spatial mean.
## Plot of
# original points & hex grid used:
ggplot() +
geom_sf(data = df, color = 'red') +
geom_sf(data = hex_grid, fill = NA) +
theme_void()
Plot showing the grouping of the points for the mean. Looks like there were groups of 1,2, and 3 points per hexagon for the mean.
Created on 2022-03-23 by the reprex package (v2.0.1)
Edit
Updated to have only one point per hexagon, losing some of the original points
## Edit for one point per hexagon:
hex_agg <- aggregate(df ,
hex_grid,
mean,
join = st_contains) %>% filter(!is.na(trait)) %>%
rownames_to_column('hex_num') # add hexagon number to group_by
## Guide to join on, has only hexagon number & centroid of contained points
hex_guide <- df_agg %>%
group_by(hex_num) %>%
summarise() %>%
st_centroid()
# The full sf object with only one point per hexagon
# this join isn't the most efficient, but slice(1) removes
# the duplicate data. You could clean df_agg before the join
# to resolve this
final_join <- df_agg %>%
st_drop_geometry() %>%
left_join(hex_guide, by = 'hex_num') %>%
group_by(hex_num) %>%
slice(1) %>%
ungroup() %>%
st_as_sf()
ggplot() +
geom_sf(data = final_join, color = 'red', size = 3) +
geom_sf(data = df, color = 'black', alpha = .5) +
geom_sf(data = hex_grid, color = 'blue', fill = NA)
The plot shows the hexagons, original data points in grey, and new red points at the centroid of grouped original points. Only 1 red point per hexagon.
I'm not sure, but perhaps this is what you are looking for?
You can experiment with the different settings/methods of smoothr::smooth() to get the desired results.
library(tidyverse)
library(igraph)
library(smoothr)
library(mapview) # for viewing purposes only
# get a matrix of points <10 meter apart
m <- st_is_within_distance(df, dist = 10, sparse = FALSE)
# creata an igraph from the matrix
g <- graph.adjacency(m, mode="undirected", diag = FALSE)
plot(g)
points that are are withing 10 metres of eachother?
# pass cluster-number to df object
df$id <- as.vector(components(G)$membership)
# create polylines (only if more than 1 point!)
df.lines <- df %>%
group_by(id) %>%
dplyr::add_tally() %>%
dplyr::filter(n > 1) %>%
summarise(geometry = st_combine(geometry)) %>%
st_cast("LINESTRING") %>%
# create smooth lines
smoothr::smooth(method = "ksmooth")
#view points and lines
mapview::mapview(list(df, df.lines))
If I understand your problem correctly, all boils down to selecting the "right" neighbors, i.e. those within a certain neighborhood, which were not used yet. If there is no such neighbor, simply use the point itself (even if it was used already in the averaging for another point).
Here's a solution using purrr::accumulate to first produce the correct indices and then simply use these indices to do the averaging:
library(purrr)
library(dplyr)
idx <- accumulate(i[-1L], function(x, y) {
x$points <- setdiff(y, x$used)
x$used <- union(x$used, y)
x
}, .init = list(used = i[[1L]], points = i[[1L]]))
idx[1:4]
# [[1]]
# [[1]]$used
# [1] 1 2 5 6 7 8 9
#
# [[1]]$points
# [1] 1 2 5 6 7 8 9
#
#
# [[2]]
# [[2]]$used
# [1] 1 2 5 6 7 8 9 10 11
#
# [[2]]$points
# [1] 10 11
#
#
# [[3]]
# [[3]]$used
# [1] 1 2 5 6 7 8 9 10 11 3 4
#
# [[3]]$points
# [1] 3 4
#
#
# [[4]]
# [[4]]$used
# [1] 1 2 5 6 7 8 9 10 11 3 4
#
# [[4]]$points
# integer(0)
The idea is that we maintain a list of used indices, that is, the ones which already used in any of the neighborhoods and the remainders (points). For instance, for the first point we use points at indices 1,2, 5, 6, 7, 8, 9 which leaves only indices 10, 11 for the second point. If there is no point left, we return integer(0).
Now that we have set up the indices list, the rest is easy, by looping through the list, selecting the indicated points (using the point itself in case there is no point left) and doing the avering:
idx %>%
imap_dfr(function(x, y) {
if (!length(x$points)) {
idx <- y
} else {
idx <- x$points
}
df[idx, , drop = FALSE] %>%
bind_cols(st_coordinates(.) %>% as_tibble()) %>%
st_drop_geometry() %>%
summarise(datetime = first(datetime),
trait = mean(trait),
X = mean(X),
Y = mean(Y))
}) %>%
st_as_sf(coords = c('X', 'Y'),
crs = "+proj=longlat +datum=WGS84 +no_defs ")
# Simple feature collection with 11 features and 2 fields
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: -94.58476 ymin: 39.09248 xmax: -94.58459 ymax: 39.09291
# CRS: +proj=longlat +datum=WGS84 +no_defs
# First 10 features:
# datetime trait geometry
# 1 2021-08-06 15:08:43 91.34714 POINT (-94.58464 39.09259)
# 2 2021-08-06 15:43:22 91.65000 POINT (-94.58473 39.09274)
# 3 2021-08-06 15:08:46 91.40000 POINT (-94.5846 39.09286)
# 4 2021-08-06 15:08:47 91.58000 POINT (-94.58459 39.09291)
# 5 2021-08-06 15:43:17 91.47000 POINT (-94.58464 39.09248)
# 6 2021-08-06 15:43:18 92.19000 POINT (-94.58464 39.09255)
# 7 2021-08-06 15:43:19 92.19000 POINT (-94.58464 39.09261)
# 8 2021-08-06 15:43:20 90.57000 POINT (-94.58464 39.09266)
# 9 2021-08-06 15:43:21 90.57000 POINT (-94.58466 39.0927)
# 10 2021-08-06 15:43:22 91.65000 POINT (-94.5847 39.09273)
If the goal is to not weight any point more than any other point in the cluster averages, it would be more balanced to use weighted averages rather than trying to force each cluster to contain a set of points unique from all other clusters.
One way to think of the below methodology is to "chop up" each observation and divvy up the pieces into clusters in such a way that the weight of the pieces in each cluster sums to 1.
This will probably be too expensive for 25k observations, so one option could be to perform this on overlapping or non-overlapping segments and stitch them together.
library(sf)
library(Rfast) # for the 'eachrow' function
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)),
datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,
coords = c("lon", "lat"),
crs = projcrs)
n <- nrow(df)
# sum the trait column for a sanity check after calculations
sumtrait <- sum(df$trait)
# first set tolerance
tolerance <- 20 # 20 meters
tol <- 1e-5 # tolerance for the weight matrix marginal sums
# create clusters of points grouped by circles centered at each point
i <- st_is_within_distance(df, df, tolerance)
# Initialize a matrix for the weight of each point within each cluster. The
# initial value represents an unweighted average for each cluster, so the row
# sums are not necessarily 1.
sz <- lengths(i)
w <- replace(matrix(0, n, n), unlist(sapply(1:n, function(x) i[[x]] + n*(x - 1))), rep.int(1/sz, sz))
# iteratively adjust the weights until the marginal sums all equal 1 (within
# tolerance)
marg <- rowSums(w)
while (max(abs(marg - 1)) > tol) {
w <- w/marg
marg <- colSums(w)
w <- eachrow(w, marg, "/")
marg <- rowSums(w)
}
df$trait <- colSums(w*df$trait)
print(df, n = nrow(df))
#> Simple feature collection with 11 features and 2 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -94.58476 ymin: 39.09248 xmax: -94.58459 ymax: 39.09291
#> CRS: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
#> trait datetime geometry
#> 1 91.37719 2021-08-06 15:08:43 POINT (-94.58463 39.09253)
#> 2 91.44430 2021-08-06 15:08:44 POINT (-94.58462 39.09262)
#> 3 91.31374 2021-08-06 15:08:46 POINT (-94.5846 39.09281)
#> 4 91.46755 2021-08-06 15:08:47 POINT (-94.58459 39.09291)
#> 5 91.64053 2021-08-06 15:43:17 POINT (-94.58464 39.09248)
#> 6 91.37719 2021-08-06 15:43:18 POINT (-94.58464 39.09255)
#> 7 91.44430 2021-08-06 15:43:19 POINT (-94.58464 39.09261)
#> 8 91.41380 2021-08-06 15:43:20 POINT (-94.58464 39.09266)
#> 9 91.41380 2021-08-06 15:43:21 POINT (-94.58466 39.0927)
#> 10 91.31880 2021-08-06 15:43:22 POINT (-94.5847 39.09273)
#> 11 91.31880 2021-08-06 15:43:23 POINT (-94.58476 39.09274)
# check that the sum of the "traits" column is unchanged
sum(df$trait) - sumtrait
#> [1] 4.875536e-07
UPDATE: If an exclusive grouping method is really needed, this implements a greedy algorithm:
avg_points <- numeric(nrow(df))
clusters <- vector("list", nrow(df))
currclust <- 0L
df$unused <- TRUE
for (cl in seq_along(df)) {
if (sum(df$unused[i[[cl]]])) {
currclust <- currclust + 1L
avg_points[currclust] <- mean(df$trait[i[[cl]]][df$unused[i[[cl]]]])
clusters[[currclust]] <- i[[cl]][df$unused[i[[cl]]]]
df$unused[i[[cl]]] <- FALSE
}
}
avg_points <- avg_points[1:currclust]
clusters <- clusters[1:currclust]
avg_points
#> [1] 91.34714 91.65000 91.40000
clusters
#> [[1]]
#> [1] 1 2 5 6 7 8 9
#>
#> [[2]]
#> [1] 10 11
#>
#> [[3]]
#> [1] 3 4
Note that the issue of uneven weightings is still present--the observations in group 1 are each weighted 1/7, while the observations in groups 2 and 3 are each weighted 1/2.

While loop inside a for loop to calculate geospatial distance between 2 datasets in R

I have a data.table with 957 geocodes. I want to match it with another dataset with 317 geocodes. The matching condition is geospatial proximity. I want to match each observation from the first dataset to an observation from the second one such that the distance between both observations is 5000 meters or less.
My data looks like this:
> muni[1:3]
mun Lat_Decimal Lon_Decimal
1: 1001 21.76672 -102.2818
2: 1002 22.16597 -102.0657
3: 1003 21.86138 -102.7248
> stations[1:3]
station_number station_lat station_long
1: 10003 25.100 -106.567
2: 10018 24.944 -106.259
3: 10031 24.523 -105.952
I am using the distm function from library(geosphere) to calculate the distance.
I figured the way to attack this problem is a while loop. The idea is to take the first observation from muni and measure the distance to the first observation in stations. If the distance is 5000 meters or less, then assign the station_number of the first observation in station to the first observation in muni. If the distance is greater than 5000, then try the next observation in muni until the distance is 5000 meters or less.
Essentially, it's a loop that finds the first observation in stations that's 5000 meters or closer to an observation in muni.
This is a preliminary attempt at it:
for (i in 1:957) {
j = 1
while (distm(muni[i, .(Lon_Decimal, Lat_Decimal)],
stations[j, .(station_long, station_lat)]) > 5000 & j <= 317) {
muni[i, station_number := as.integer(stations[j, station_number])]
muni[i, distance := distm(muni[i, .(Lon_Decimal, Lat_Decimal)],
stations[j, .(station_long, station_lat)])]
j = j + 1
}
}
I can tell this is not working because none of the rows in ´muni´ appear to have been overwritten after running this loop for (i in 1:3). I suppose there is an error in my loop that is ignoring the station_number := and distance := parts.
I would expect this loop to overwrite muni such that all the entire column had a station_number.
I've read your few sample points as data.frames and converted them to sf below for the answer. If you're attached to geosphere, forgive the pun, everything should still apply the same, given that geosphere::distm also returns a matrix of distances.
First we get your data into sf format:
library(sf)
stations_raw <- "station_number station_lat station_long
1: 10003 25.100 -106.567
2: 10018 24.944 -106.259
3: 10031 24.523 -105.952"
mun_raw <- "mun Lat_Decimal Lon_Decimal
1: 1001 21.76672 -102.2818
2: 1002 22.16597 -102.0657
3: 1003 21.86138 -102.7248"
mun_df <- read.table(text = mun_raw)
stations_df <- read.table(text = stations_raw)
mun_sf <- st_as_sf(mun_df, coords = c("Lon_Decimal", "Lat_Decimal"), crs = 4326)
stations_sf <- st_as_sf(stations_df,
coords = c("station_long", "station_lat"), crs = 4326)
Then, we find the minimum for each interaction between dots:
closest <- list()
for(i in seq_len(nrow(mun_sf))){
closest[[i]] <- stations_sf[which.min(
st_distance(stations_sf, mun_sf[i,])),]
}
Finally, we extract the identifiers and attach them to the original df, removing the mun_id as you request:
mun_sf$closest_station <- purrr::map_chr(closest, "station_number")
mun_sf <- mun_sf[, c("closest_station", "geometry")]
mun_sf
#> Simple feature collection with 3 features and 1 field
#> geometry type: POINT
#> dimension: XY
#> bbox: xmin: -102.7248 ymin: 21.76672 xmax: -102.0657 ymax: 22.16597
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
#> closest_station geometry
#> 1: 10031 POINT (-102.2818 21.76672)
#> 2: 10031 POINT (-102.0657 22.16597)
#> 3: 10031 POINT (-102.7248 21.86138)
The plot below helps visually check that, in this toy example, we've got the right answer.
ggplot() +
geom_sf(data = mun_sf, colour = "red") +
geom_sf_text(data = mun_sf, aes(label = mun), nudge_x = 0.25) +
geom_sf(data = stations_sf, colour = "blue") +
geom_sf_text(data = stations_sf, aes(label = station_number), nudge_x = -0.25)
#> Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may
#> not give correct results for longitude/latitude data
#> Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may
#> not give correct results for longitude/latitude data

R - Fitting a grid over a City Map and inputting data into grid squares

I'm trying to place a grid over San Jose like this:
Grid of San Jose
You can make the grid visually using the following code:
ca_cities = tigris::places(state = "CA") #using tigris package to get shape file of all CA cities
sj = ca_cities[ca_cities$NAME == "San Jose",] #specifying to San Jose
UTM_ZONE = "10" #the UTM zone for San Jose, will be used to convert the proj4string of sj into UTM
main_sj = sj#polygons[[1]]#Polygons[[5]] #the portion of the shape file I focus on. This is the boundary of san jose
#converting the main_sj polygon into a spatialpolygondataframe using the sp package
tst_ps = sp::Polygons(list(main_sj), 1)
tst_sps = sp::SpatialPolygons(list(tst_ps))
proj4string(tst_sps) = proj4string(sj)
df = data.frame(f = 99.9)
tst_spdf = sp::SpatialPolygonsDataFrame(tst_sps, data = df)
#transforming the proj4string and declaring the finished map as "map"
map = sp::spTransform(tst_sps, CRS(paste0("+proj=utm +zone=",UTM_ZONE," ellps=WGS84")))
#designates the number of horizontal and vertical lines of the grid
NUM_LINES_VERT = 25
NUM_LINES_HORZ = 25
#getting bounding box of map
bbox = map#bbox
#Marking the x and y coordinates for each of the grid lines.
x_spots = seq(bbox[1,1], bbox[1,2], length.out = NUM_LINES_HORZ)
y_spots = seq(bbox[2,1], bbox[2,2], length.out = NUM_LINES_VERT)
#creating the coordinates for the lines. top and bottom connect to each other. left and right connect to each other
top_vert_line_coords = expand.grid(x = x_spots, y = y_spots[1])
bottom_vert_line_coords = expand.grid(x = x_spots, y = y_spots[length(y_spots)])
left_horz_line_coords = expand.grid(x = x_spots[1], y = y_spots)
right_horz_line_coords = expand.grid(x = x_spots[length(x_spots)], y = y_spots)
#creating vertical lines and adding them all to a list
vert_line_list = list()
for(n in 1 : nrow(top_vert_line_coords)){
vert_line_list[[n]] = sp::Line(rbind(top_vert_line_coords[n,], bottom_vert_line_coords[n,]))
}
vert_lines = sp::Lines(vert_line_list, ID = "vert") #creating Lines object of the vertical lines
#creating horizontal lines and adding them all to a list
horz_line_list = list()
for(n in 1 : nrow(top_vert_line_coords)){
horz_line_list[[n]] = sp::Line(rbind(left_horz_line_coords[n,], right_horz_line_coords[n,]))
}
horz_lines = sp::Lines(horz_line_list, ID = "horz") #creating Lines object of the horizontal lines
all_lines = sp::Lines(c(horz_line_list, vert_line_list), ID = 1) #combining horizontal and vertical lines into a single grid format
grid_lines = sp::SpatialLines(list(all_lines)) #converting the lines object into a Spatial Lines object
proj4string(grid_lines) = proj4string(map) #ensuring the projections are the same between the map and the grid lines.
trimmed_grid = intersect(grid_lines, map) #grid that shapes to the san jose map
plot(map) #plotting the map of San Jose
lines(trimmed_grid) #plotting the grid
However, I am struggling to turn each grid 'square' (some of the grid pieces are not squares since they fit to the shape of the san jose map) into a bin which I could input data into. Put another way, if each grid 'square' was numbered 1:n, then I could make a dataframe like this:
grid_id num_assaults num_thefts
1 1 100 89
2 2 55 456
3 3 12 1321
4 4 48 498
5 5 66 6
and fill each grid 'square' with data the point location of each crime occurrence, hopefully using the over() function from the sp package.
I have tried solving this problem for weeks, and I can't figure it out. I have looked for an easy solution, but I can't seem to find it. Any help would be appreciated.
Additionally, here's an sf and tidyverse-based solution:
With sf, you can make a grid of squares with the st_make_grid() function. Here I'll make a 2km grid over San Jose's bounding box, then intersect it with the boundary of San Jose. Note that I'm projecting to UTM zone 10N so I can specify the grid size in meters.
library(tigris)
library(tidyverse)
library(sf)
options(tigris_class = "sf", tigris_use_cache = TRUE)
set.seed(1234)
sj <- places("CA", cb = TRUE) %>%
filter(NAME == "San Jose") %>%
st_transform(26910)
g <- sj %>%
st_make_grid(cellsize = 2000) %>%
st_intersection(sj) %>%
st_cast("MULTIPOLYGON") %>%
st_sf() %>%
mutate(id = row_number())
Next, we can generate some random crime data with st_sample() and plot it to see what we are working with.
thefts <- st_sample(sj, size = 500) %>%
st_sf()
assaults <- st_sample(sj, size = 200) %>%
st_sf()
plot(g$geometry)
plot(thefts, add = TRUE, col = "red")
Crime data can then be joined to the grid spatially with st_join(). We can plot to check our results.
theft_grid <- g %>%
st_join(thefts) %>%
group_by(id) %>%
summarize(num_thefts = n())
plot(theft_grid["num_thefts"])
We can then do the same with the assaults data, then join the two datasets together to get the desired result. If you had a lot of crime datasets, these could be modified to work within some variation of purrr::map().
assault_grid <- g %>%
st_join(assaults) %>%
group_by(id) %>%
summarize(num_assaults = n())
st_geometry(assault_grid) <- NULL
crime_data <- left_join(theft_grid, assault_grid, by = "id")
crime_data
Simple feature collection with 190 features and 3 fields
geometry type: GEOMETRY
dimension: XY
bbox: xmin: 584412 ymin: 4109499 xmax: 625213.2 ymax: 4147443
epsg (SRID): 26910
proj4string: +proj=utm +zone=10 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
# A tibble: 190 x 4
id num_thefts num_assaults geometry
<int> <int> <int> <GEOMETRY [m]>
1 1 2 1 POLYGON ((607150.3 4111499, 608412 4111499, 608412 4109738,…
2 2 4 1 POLYGON ((608412 4109738, 608412 4111499, 609237.8 4111499,…
3 3 3 1 POLYGON ((608412 4113454, 608412 4111499, 607150.3 4111499,…
4 4 2 2 POLYGON ((609237.8 4111499, 608412 4111499, 608412 4113454,…
5 5 1 1 MULTIPOLYGON (((610412 4112522, 610412 4112804, 610597 4112…
6 6 1 1 POLYGON ((616205.4 4113499, 616412 4113499, 616412 4113309,…
7 7 1 1 MULTIPOLYGON (((617467.1 4113499, 618107.9 4113499, 617697.…
8 8 2 1 POLYGON ((605206.8 4115499, 606412 4115499, 606412 4114617,…
9 9 5 1 POLYGON ((606412 4114617, 606412 4115499, 608078.2 4115499,…
10 10 1 1 POLYGON ((609242.7 4115499, 610412 4115499, 610412 4113499,…
# ... with 180 more rows
With a Spatial* object, as your data
library(tigris)
ca_cities = tigris::places(state = "CA") #using tigris package to get shape file of all CA cities
sj = ca_cities[ca_cities$NAME == "San Jose",] #specifying to San Jose
sjutm = sp::spTransform(sj, CRS("+proj=utm +zone=10 +datum=WGS84"))
You can make a grid of polygons like this
library(raster)
r <- raster(sjutm, ncol=25, nrow=25)
rp <- as(r, 'SpatialPolygons')
Show it
plot(sjutm, col='red')
lines(rp, col='blue')
To count the number of cases per grid cell (using some random points here) you do not want to use the polygons but rather the RasterLayer
set.seed(0)
x <- runif(500, xmin(r), xmax(r))
y <- runif(500, ymin(r), ymax(r))
xy1 <- cbind(x, y)
x <- runif(500, xmin(r), xmax(r))
y <- runif(500, ymin(r), ymax(r))
xy2 <- cbind(x, y)
d1 <- rasterize(xy1, r, fun="count", background=0)
d2 <- rasterize(xy2, r, fun="count", background=0)
plot(d1)
plot(sjutm, add=TRUE)
Followed by
s <- stack(d1, d2)
names(s) = c("assault", "theft")
s <- mask(s, sjutm)
plot(s, addfun=function()lines(sjutm))
To get the table you are after
p <- rasterToPoints(s)
cell <- cellFromXY(s, p[,1:2])
res <- data.frame(grid_id=cell, p[,3:4])
head(res)
# grid_id assault theft
#1 1 1 1
#2 2 0 1
#3 3 0 3
#4 5 1 1
#5 6 1 0
#6 26 0 0
You can also create a SpatialPolygonsDataFrame from the results
pp <- as(s, 'SpatialPolygonsDataFrame')
pp
#class : SpatialPolygonsDataFrame
#features : 190
#extent : 584411.5, 623584.9, 4109499, 4147443 (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=utm +zone=10 +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
#variables : 2
#names : assault, theft
#min values : 0, 0
#max values : 4, 5
If your goal is only the visual, and not necessarily all the grid-aggregation code and data you can generate an interactive map and grid in library(mapdeck) (noting you'll need a Mapbox access token)
The first step to generate the data is borrowed from #kwalkertcu 's answer
library(tigris)
library(sf)
options(tigris_class = "sf", tigris_use_cache = TRUE)
set.seed(1234)
sj <- places("CA", cb = TRUE) %>%
filter(NAME == "San Jose") %>%
st_transform(26910)
thefts <- st_sample(sj, size = 500) %>%
st_sf() %>%
st_transform(crs = 4326)
## some random weight data
thefts$weight <- sample(1:100, size = nrow(thefts), replace = T)
Then, given a sf object with a weight column you can plot it using add_screengrid()
library(mapdeck)
set_token("MAPBOX_TOKEN")
mapdeck(
style = mapdeck_style("dark")
, location = c(-121.8, 37.3)
, zoom = 6
) %>%
add_screengrid(
data = thefts
, cell_size = 15
, weight = "weight"
)
Notes:
I'm using the github version of mapdeck where the API has changed slightly, but the CRAN version should yield the same result.

Resources