Geocoding: Efficient way to find the distance between two sets of locations - r

I have a set of coordinates of the locations of different individuals, and another set of coordinates of different drop off boxes, for their ballots. I'm trying to find the distance between their residence, and the nearest dropbox. I've attached a copy of the code I have to work through that as of now--it was replicated from another stack overflow example. However, it is not too efficient, as the dataset I'm working with is millions of rows, and the code relies on finding all possible combinations of coordinates, and then pulling the least distance. Is there a more efficient way to deal with this?
What I currently have:
# Made-Up Data
library(geosphere)
library(tidyverse)
geo_voters <- data.frame(voter_id = c(12345, 45678, 89011)
long=c(-43.17536, -43.17411, -43.36605),
lat=c(-22.95414, -22.9302, -23.00133))
geo_dropoff_boxes <- data.frame(long=c(-43.19155, -43.33636, -67.45666),
lat=c(-22.90353, -22.87253, -26,78901))
# Code to find the distance between voters, and the dropoff boxes
# Order into a newdf as needed first.
# First, the voters:
voter_addresses <- data.frame(voter_id = as.character(geo_voters$voter_id),
lon_address = geo_voters$long,
lat_address = geo_voters$lat
)
# Second, the polling locations:
polling_address <- data.frame(place_number = 1:nrow(geo_dropoff_boxes),
lon_place = geo_dropoff_boxes$long,
lat_place = geo_dropoff_boxes$lat
)
# Create nested dfs:
voter_nest <- nest(voter_addresses, -voter_id, .key = 'voter_coords')
polling_nest <- nest(polling_address, -place_number, .key = 'polling_coords')
# Combine for combinations:
data_master <- crossing(voter_nest, polling_nest)
# Calculate shortest distance:
shortest_dist <- data_master %>%
mutate(dist = map2_dbl(voter_coords, polling_coords, distm)) %>%
group_by(voter_id) %>%
filter(dist == min(dist)) %>%
mutate(dist_km = dist/1000,
voter_id = as.character(voter_id)) %>%
select(voter_id, dist_km)

The sf package makes this simple. The st_as_sf() function converts data frame of lat-long values to georeferenced points, and the st_distance() function calculates the distances between them. When running st_as_sf(), you'll need to specify a coordinate reference system. It looks like you're using latitude and longitude, so I specify crs="epsg:4326", which is the most common latitude/longitude reference.
library( sf )
geo_voters <- data.frame(voter_id = c(12345, 45678, 89011)
long=c(-43.17536, -43.17411, -43.36605),
lat=c(-22.95414, -22.9302, -23.00133))
geo_dropoff_boxes <- data.frame(long=c(-43.19155, -43.33636, -67.45666),
lat=c(-22.90353, -22.87253, -26.78901))
# convert the data to sf features
geo_voters = st_as_sf( geo_voters, coords=c('long', 'lat'), crs="epsg:4326" )
geo_dropoff_boxes = st_as_sf( geo_dropoff_boxes, coords=c('long', 'lat'), crs="epsg:4326" )
# calculate the distances between voters and drop boxes
dist = st_distance( geo_voters, geo_dropoff_boxes )
print(dist)
Now each row represents a voter and each column represents their distance to a drop box (in meters):
Units: [m]
[,1] [,2] [,3]
[1,] 5866.745 18821.87 2482400
[2,] 3461.945 17813.57 2483210
[3,] 20916.618 14641.09 2462186

Related

Can I bound an st_distance call by a polygon?

I have seen similar posts on this topic (see, for example, here and here) but not one that is specific to the sf-tidyverse ecosystem.
I have a series of lakes, a series of sample points within each lake, and a "focal point" in each lake that represents where a boat launch is.
I want to calculate the "boater's shortest travel distance" to each sample point from the boat launch. However, I want to somehow "bound" these distances using the lake polygon such that distances cannot be calculated across land. I could imagine this being done by having the "straight line" snake along the lake edge for as long as needed before it can resume being a straight line. I could also imagine the "straight line" being decomposed into line segments that bend around any intervening land. I'm open to a variety of implementations!
I have seen elsewhere (such as here) the idea of converting the bounding polygon to a raster, making the water one value and the land another, much higher value, and then doing a "least-cost distance," where the cost of going over land is prohibitive. However, I don't know how one would actually do this in the raster/sf ecosystem.
Here's the code I used to make this map:
library(dplyr)
library(sf)
library(ggplot2)
Moose.ssw = sswMN.sf %>% filter(lake == "Moose")
Moose.lake = MN_lakes4 %>% filter(str_detect(map_label, "Moose")) %>% filter(cty_name == "Beltrami")
Moose.access = try3 %>% filter(LAKE_NAME == "Moose") %>% filter(COUNTYNAME == "Beltrami")
Moose.box = st_bbox(Moose.ssw)
ggplot() +
geom_sf(data = Moose.lake, color="lightblue") +
geom_sf(data = Moose.access, color = "red", size = 2) +
geom_sf(data = Moose.ssw, mapping = aes(color= Nitellopsis_obtusa_n)) +
coord_sf(xlim = c(Moose.box[1], Moose.box[3]), ylim = c(Moose.box[2], Moose.box[4]))
And here's some code that calculates straight-line distances splendidly--maybe it can be wrappered somehow?
Moose.pt.dists = st_distance(Moose.access, Moose.ssw, by_element = TRUE)
Files needed to make the data objects referenced above can be pulled from my Github page (they are files produced via dput(). Link to my Github.
I am a solid R programmer but I am new to geospatial work, so if I could even just be pointed in a fruitful direction, I may be able to program my own way out of this!
Here's a solution using sfnetworks, which fits in with the tidyverse well.
The code below should
regularly sample the bounding box of the lake (creates evenly-spaced points)
connect them to their closest neighbors
remove the connections that cross land
find the shortest path from the boat launch to the sample location(s) by following the lines that remain.
The results are not exact, but are pretty close. You can increase precision by increasing the number of sampled points. 1000 points are used below.
library(tidyverse)
library(sf)
library(sfnetworks)
library(nngeo)
# set seed to make the script reproducible,
# since there is random sampling used
set.seed(813)
# Getting your data:
x <- dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.lake")
# Subset to get just one lake
moose_lake <- x[5,]
boat_ramp <- dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.access")
sample_locations <- dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.ssw")
sample_bbox <- dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.box")
# sample the bounding box with regular square points, then connect each point to the closest 9 points
# 8 should've worked, but left some diagonals out.
sq_grid_sample <- st_sample(st_as_sfc(st_bbox(moose_lake)), size = 1000, type = 'regular') %>% st_as_sf() %>%
st_connect(.,.,k = 9)
# remove connections that are not within the lake polygon
sq_grid_cropped <- sq_grid_sample[st_within(sq_grid_sample, moose_lake, sparse = F),]
# make an sfnetwork of the cropped grid
lake_network <- sq_grid_cropped %>% as_sfnetwork()
# find the (approximate) distance from boat ramp to point 170 (far north)
pt170 <- st_network_paths(lake_network,
from = boat_ramp,
to = sample_locations[170,]) %>%
pull(edge_paths) %>%
unlist()
lake_network %>%
activate(edges) %>%
slice(pt170) %>%
st_as_sf() %>%
st_combine() %>%
st_length()
#> 2186.394 [m]
Looks like it is about 2186m from the boat launch to sample location 170 at the far north end of the lake.
# Plotting all of the above, path from boat ramp to point 170:
ggplot() +
geom_sf(data = sq_grid_sample, alpha = .05) +
geom_sf(data = sq_grid_cropped, color = 'dodgerblue') +
geom_sf(data = moose_lake, color = 'blue', fill = NA) +
geom_sf(data = boat_ramp, color = 'springgreen', size = 4) +
geom_sf(data = sample_locations[170,], color = 'deeppink1', size = 4) +
geom_sf(data = lake_network %>%
activate(edges) %>%
slice(pt170) %>%
st_as_sf(),
color = 'turquoise',
size = 2) +
theme_void()
Though only the distance from the boat launch to one sample point is found and plotted above, the network is there to find all of the distances. You may need to use *apply or purrr, and maybe a small custom function to find the 'one-to-many' distances of the launch to all sample points.
This page on sfnetworks will be helpful in writing the one-to-many solution.
edit:
To find all distances from the boat launch to the sample points:
st_network_cost(lake_network,
from = boat_ramp,
to = sample_locations)
This is much faster for me than a for loop or the sp solution posted below. Some code in the sampling may need to be adjusted since the st_network_cost will remove any duplicate distances. The sample_locations (or Moose.ssw in #bajcz answer) will need to be cleaned to remove duplicate points as well. There are 180 unique points of the 322 rows.
I was planning to answer my own question today, as I had just managed to get the code using the gdistance and sp method linked to in my question working, but mrhellmann beat me to it! Since their answer works also and is quite nice and elegant, I thought I'd just post code here that utilizes both approaches and compares the outcomes for those who are interested (and in case one or the other doesn't work in your context). They are about equivalently fast, although that's with a for() loop in mrhellmann's sfnetworks version, so it might be faster if that bit can be vectorized, which I'm sure is possible.
#All the hooplah needed to get things started and talking the same language.
x = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.lake")
# Subset to get just one lake
Moose.lake = x[5,]
Moose.access = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.access")
Moose.ssw = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.ssw")
Moose.box = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.box")
library(sf)
library(sp)
library(raster)
library(gdistance)
library(tidyverse)
library(sfnetworks)
library(nngeo)
library(mosaic)
This is "my" way using sp and gdistance tools:
ptm <- proc.time()
#Make all the objects needed into spatial class objects.
test.pts = as(Moose.access, Class = "Spatial")
ssw.pts = as(Moose.ssw, Class = "Spatial")
test.bounds = as(Moose.lake, Class = "Spatial")
#Turn the lake into a raster of resolution 1000 x 1000 (which is arbitrary) and then make all points not in the lake = 0 so that no distances can "afford" to cross over land.
test.raster = raster(extent(test.bounds), nrow=1000, ncol=1000)
lake.raster = rasterize(test.bounds, test.raster)
lake.raster[is.na(lake.raster)] = 0
##For some lakes, although not this one, the public access was just off the lake, which resulted in distances of Inf. This code puts a circular buffer around the dock locations to prevent this. It makes a new raster with 1s at the dock location(s), finds all indices of cells within the buffer distance of each dock (here, 300, which is overkill), and makes the corresponding cells in the lake raster also 1 so that they are considered navigable. This makes the distances slightly inaccurate because it may allow some points to be more easily reachable than they should be, but it is better than the alternative!
access.raster = rasterize(test.pts, lake.raster, field = 1)
index.spots = raster::extract(access.raster, test.pts, buffer=300, cellnumbers = T)
index.spots2 = unlist(lapply(index.spots, "[", , 1))
lake.raster[index.spots2] = 1
#Make a transition matrix so that the cost of moving from one cell to the next can be understood. TBH, I don't understand this part well beyond that.
transition.mat1 = transition(lake.raster, transitionFunction=mean, directions=16) #This code does take a little while.
transition.mat1 = geoCorrection(transition.mat1,type="c")
#Calculates the actual cost-based distances.
dists.test = costDistance(transition.mat1, test.pts, ssw.pts)
proc.time() - ptm #About 55 seconds on my laptop.
And then for comparison, here's the sfnetworks way as provided by mrhellmann.
ptm <- proc.time()
sq_grid_sample = st_sample(st_as_sfc(st_bbox(Moose.lake)), size = 1000, type = 'regular') %>% st_as_sf() %>%
st_connect(.,.,k = 9)
sq_grid_cropped = sq_grid_sample[st_within(sq_grid_sample, Moose.lake, sparse = F)] #I needed to delete a comma in this line to get it to work--I don't think we needed row,column indexing here.
lake_network = sq_grid_cropped %>% as_sfnetwork()
##Using their code to generate all the distances between the dock and the sample points. This might be vectorizable.
dists.test2 = rep(0, nrow(Moose.ssw))
for (i in 1:nrow(Moose.ssw)) {
dist.pt = st_network_paths(lake_network,
from = Moose.access,
to = Moose.ssw[i,]) %>%
pull(edge_paths) %>%
unlist() #This produces a vertices we must go through I think?
dists.test2[i] = lake_network %>%
activate(edges) %>%
slice(dist.pt) %>%
st_as_sf() %>%
st_combine() %>%
st_length()
}
proc.time() - ptm #About 58 seconds on my laptop.
Here are two graphs that I think demonstrate that both approaches produce relatively similar numbers without a lot of signs of bias, although it does look like some of the distances ended up being a little longer for the sfnetworks approach. The first is overlapped density plots of the distance estimates using the two approaches and the second is a "1-to-1" scatterplot of the distances plotted against each other, and you can see that the fit to a 1-to-1 line would be pretty good.
hist.df = data.frame(dists = c(dists.test, dists.test2), version = rep(c("sp", "sfnetworks"), each = 322))
gf_density(~dists, fill=~version, data=hist.df )
gf_point(dists.test2~dists.test)
I know this question has been answered, but I thought I'd throw an alternative method in here anyway. And, it looks like (for me at least) this approach is a little faster than the others. This uses terra, which is the "replacement" for the raster package; many functions have the same name and do the same job, but the main difference is that terra uses SpatRaster objects, whereas raster uses raster objects.
# import data ---
x = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.lake")
# Subset to get just one lake
Moose.lake = x[5,]
Moose.access = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.access")
Moose.ssw = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.ssw")
Moose.box = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.box")
The first thing we do is make the lake into a raster, which is called a SpatRaster by terra. We use the extent (terra::ext()) of Moose.lake, 1000 rows and 1000 cols, and set the crs to the same as Moose.lake. Initially, we give every cell a value of 1, but then we can use terra::mask() to give every value outside Moose.lake a value of 2 (these will be the "high cost" or "no go zone" cells).
# step 1 - rasterize() the lake ---
# find the spatial extent of the lake
ext(Moose.lake) %>%
# make a raster with the same extent, and assign all cells to value of 1
rast(nrow = 1000, ncol = 1000, crs = st_crs(Moose.lake)[1], vals = 1) %>%
# set all cells outside the lake a value of 2
mask(vect(Moose.lake), updatevalue = 2) %>%
{. ->> moose_rast}
moose_rast
# class : SpatRaster
# dimensions : 1000, 1000, 1 (nrow, ncol, nlyr)
# resolution : 2.899889, 3.239947 (x, y)
# extent : 388474, 391373.9, 5265182, 5268422 (xmin, xmax, ymin, ymax)
# coord. ref. : +proj=utm +zone=15 +datum=NAD83 +units=m +no_defs
# source : memory
# name : lyr.1
# min value : 1
# max value : 2
plot(moose_rast)
Then, we give the cell where Moose.access is a value of 3 (or any value other than 1 or 2) - this will be our starting point.
# now, make the boat ramp a value of 3 (this is the starting point)
# find the cell number based on the coordinates of Moose.access
moose_rast %>%
cellFromXY(st_coordinates(Moose.access)) %>%
{. ->> access_cell}
access_cell
# [1] 561618
# assign that cell a value of 3 (or any number you want other than 1 or 2)
values(moose_rast)[access_cell] <- 3
# check it
moose_rast %>% plot
Although you can't really see it on the plot because the cell is so small, we can tell that the cell value has changed because our legend now includes a value of 3.
Next up, we use terra::gridDistance() to find the distance from the starting cell (value of 3), to every other cell. We specify origin = 3 to assign this cell as the starting point, and we tell it not to traverse any cells with value of 2 using omit = 2. This function returns a SpatRaster, but this time the cell values are the distances to the origin.
# now, find distances from the access_cell to every other cell
moose_rast %>%
gridDistance(origin = 3, omit = 2) %>%
{. ->> moose_rast_distances}
moose_rast_distances
# class : SpatRaster
# dimensions : 1000, 1000, 1 (nrow, ncol, nlyr)
# resolution : 2.899889, 3.239947 (x, y)
# extent : 388474, 391373.9, 5265182, 5268422 (xmin, xmax, ymin, ymax)
# coord. ref. : +proj=utm +zone=15 +datum=NAD83 +units=m +no_defs
# source : memory
# name : lyr.1
# min value : 0
# max value : 2424.096
# check it
moose_rast_distances %>% plot
In this plot, the areas closest to the access cell are white, and those farthest away are green.
So now all we have to do is find the cell numbers of the sample sites within the lake and pull out their cell values. We use terra::cellFromXY() to find the cell numbers, based on a set of XY coordinates.
# now, find the cell numbers of all the study sites
moose_rast %>%
cellFromXY(st_coordinates(Moose.ssw)) %>%
{. ->> site_cells}
# cell numbers
site_cells %>%
head(50)
# [1] 953207 953233 930156 930182 930207 930233 907156 907182 907207 884078 884130
# [12] 884156 884182 884207 861052 861078 861104 861130 861156 861182 861207 837026
# [23] 837052 837078 837104 837130 837156 837182 837207 814026 814052 814078 814104
# [34] 814130 814156 814182 814207 791026 791104 791130 791156 791182 768182 745182
# [45] 745207 722026 722233 699259 675052 675285
# and now their distance values
values(moose_rast_distances)[site_cells] %>% head(50)
# [1] 2212.998 2241.812 2144.020 2115.206 2138.479 2167.293 2069.501 2040.688 2063.960
# [10] 2081.424 2023.796 1994.983 1966.169 1989.441 2078.719 2006.905 1978.092 1949.278
# [19] 1920.464 1891.650 1914.923 2119.358 2043.960 1968.563 1900.333 1871.519 1842.705
# [28] 1813.891 1837.164 2086.047 2010.650 1935.253 1859.856 1797.000 1768.186 1739.372
# [37] 1762.645 2052.736 1826.545 1751.148 1693.667 1664.854 1590.335 1533.733 1488.110
# [46] 1952.805 1384.778 1281.445 1809.338 1174.872
And to make the distances a little more user friendly, we can instead put them into a new column in the sf collection.
# now, make a new column in the study sites which is the distance to the access_cell
Moose.ssw %>%
rowwise %>%
mutate(
distance_to_access = cellFromXY(moose_rast, st_coordinates(geometry)) %>% values(moose_rast_distances)[.]
) %>%
select(distance_to_access, everything())
# Simple feature collection with 322 features and 122 fields
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: 388549 ymin: 5265332 xmax: 391324 ymax: 5268332
# CRS: +proj=utm +zone=15 +datum=NAD83 +unit=m
# # A tibble: 322 x 123
# # Rowwise:
# distance_to_access lake county dow lake_size_acres contact year_discovered first_year_trea~ survey_year
# <dbl> <chr> <chr> <int> <dbl> <chr> <int> <chr> <int>
# 1 2213. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 2 2242. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 3 2144. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 4 2115. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 5 2138. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 6 2167. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 7 2070. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 8 2041. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 9 2064. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# 10 2081. Moose Beltrami 4001100 601. Nicole Kovar 2016 NoTreat 2017
# # ... with 312 more rows, and 114 more variables: survey_dates <chr>, surveyor <chr>, pointid <int>, depth_ft <dbl>,
# # Myriophyllum_spicatum_or_hybrid_n <int>, Potamogeton_crispus_n <int>, Bidens_beckii_n <int>,
# # Brasenia_schreberi_n <int>, Ceratophyllum_demersum_n <int>, Ceratophyllum_echinatum_n <int>, Chara_sp_n <int>,
# # Eleocharis_acicularis_n <int>, Eleocharis_palustris_n <int>, Elodea_canadensis_n <int>,
# # Elodea_nuttallii_n <int>, Heteranthera_dubia_n <int>, Lemna_minor_n <int>, Lemna_trisulca_n <int>,
# # Myriophyllum_heterophyllum_n <int>, Myriophyllum_sibiricum_n <int>, Myriophyllum_verticillatum_n <int>,
# # Najas_flexilis_n <int>, Najas_gracillima_n <int>, Najas_guadalupensis_n <int>, Najas_marina_n <int>, ...
The distances are in metres because the CRS is UTM.
Additionally, gridDistance() could also be used to find the shortest distance from every point within the lake, to the lake'e edge. To do this we say origin = 2, which rather than just a single point like our boat ramp earlier, is every cell on land.
moose_rast %>%
gridDistance(origin = 2) %>%
plot
So I ran the two approaches in your answer above, and compared the results and times to this method.
Time-wise, the terra approach was much faster, on my computer anyway:
sp ~230 secs
sfnetworks ~83 secs
terra ~5 secs
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# sp
#All the hooplah needed to get things started and talking the same language.
x = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.lake")
# Subset to get just one lake
Moose.lake = x[5,]
Moose.access = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.access")
Moose.ssw = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.ssw")
Moose.box = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.box")
library(sf)
library(sp)
library(raster)
library(gdistance)
library(tidyverse)
library(sfnetworks)
library(nngeo)
library(mosaic)
ptm <- proc.time()
#Make all the objects needed into spatial class objects.
test.pts = as(Moose.access, Class = "Spatial")
ssw.pts = as(Moose.ssw, Class = "Spatial")
test.bounds = as(Moose.lake, Class = "Spatial")
#Turn the lake into a raster of resolution 1000 x 1000 (which is arbitrary) and then make all points not in the lake = 0 so that no distances can "afford" to cross over land.
test.raster = raster(extent(test.bounds), nrow=1000, ncol=1000)
lake.raster = rasterize(test.bounds, test.raster)
lake.raster[is.na(lake.raster)] = 0
##For some lakes, although not this one, the public access was just off the lake, which resulted in distances of Inf. This code puts a circular buffer around the dock locations to prevent this. It makes a new raster with 1s at the dock location(s), finds all indices of cells within the buffer distance of each dock (here, 300, which is overkill), and makes the corresponding cells in the lake raster also 1 so that they are considered navigable. This makes the distances slightly inaccurate because it may allow some points to be more easily reachable than they should be, but it is better than the alternative!
access.raster = rasterize(test.pts, lake.raster, field = 1)
index.spots = raster::extract(access.raster, test.pts, buffer=300, cellnumbers = T)
index.spots2 = unlist(lapply(index.spots, "[", , 1))
lake.raster[index.spots2] = 1
#Make a transition matrix so that the cost of moving from one cell to the next can be understood. TBH, I don't understand this part well beyond that.
transition.mat1 = transition(lake.raster, transitionFunction=mean, directions=16) #This code does take a little while.
transition.mat1 = geoCorrection(transition.mat1,type="c")
#Calculates the actual cost-based distances.
dists.test = costDistance(transition.mat1, test.pts, ssw.pts)
proc.time() - ptm
# 234 secs
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# sfnetworks
#All the hooplah needed to get things started and talking the same language.
x = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.lake")
# Subset to get just one lake
Moose.lake = x[5,]
Moose.access = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.access")
Moose.ssw = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.ssw")
Moose.box = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.box")
library(sf)
library(sp)
library(raster)
library(gdistance)
library(tidyverse)
library(sfnetworks)
library(nngeo)
library(mosaic)
ptm <- proc.time()
sq_grid_sample = st_sample(st_as_sfc(st_bbox(Moose.lake)), size = 1000, type = 'regular') %>% st_as_sf() %>%
st_connect(.,.,k = 9)
sq_grid_cropped = sq_grid_sample[st_within(sq_grid_sample, Moose.lake, sparse = F)] #I needed to delete a comma in this line to get it to work--I don't think we needed row,column indexing here.
lake_network = sq_grid_cropped %>% as_sfnetwork()
##Using their code to generate all the distances between the dock and the sample points. This might be vectorizable.
dists.test2 = rep(0, nrow(Moose.ssw))
for (i in 1:nrow(Moose.ssw)) {
dist.pt = st_network_paths(lake_network,
from = Moose.access,
to = Moose.ssw[i,]) %>%
pull(edge_paths) %>%
unlist() #This produces a vertices we must go through I think?
dists.test2[i] = lake_network %>%
activate(edges) %>%
slice(dist.pt) %>%
st_as_sf() %>%
st_combine() %>%
st_length()
}
proc.time() - ptm
# 83 secs
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# terra
#All the hooplah needed to get things started and talking the same language.
x = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.lake")
# Subset to get just one lake
Moose.lake = x[5,]
Moose.access = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.access")
Moose.ssw = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.ssw")
Moose.box = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/Moose.box")
library(tidyverse)
library(sf)
library(terra)
ptm <- proc.time()
# make rastre of moose lake
ext(Moose.lake) %>%
# make a raster with the same extent, and assign all cells to value of 1
rast(nrow = 1000, ncol = 1000, crs = st_crs(Moose.lake)[1], vals = 1) %>%
# set all cells outside the lake a value of 2
mask(vect(Moose.lake), updatevalue = 2) %>%
{. ->> moose_rast}
# find access cell
moose_rast %>%
cellFromXY(st_coordinates(Moose.access)) %>%
{. ->> access_cell}
# assign that cell a value of 3 (or any number you want other than 1 or 2)
values(moose_rast)[access_cell] <- 3
# find values to every other cell
moose_rast %>%
gridDistance(origin = 3, omit = 2) %>%
{. ->> moose_rast_distances}
# make column with distances to each sample site
distances <- Moose.ssw %>%
rowwise %>%
mutate(
distance_to_access = cellFromXY(moose_rast, st_coordinates(geometry)) %>% values(moose_rast_distances)[.]
) %>%
select(distance_to_access, everything())
proc.time() - ptm
# 5 secs
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
hist.df2 = data.frame(dists = c(dists.test, dists.test2, distances$distance_to_access), version = rep(c("sp", "sfnetworks", "terra"), each = 322))
gf_density(~dists, fill=~version, data=hist.df2)
And the distances values from the terra method are pretty similar to the sp method:

How to get the neighboring counties of all counties based on the content of the file?

I have a shp file. I want to get the names of neighboring counties in all regions according to the latitude and longitude in the file. I found that some regions obviously have neighboring counties, but I didn’t get the neighboring counties when I ran the code. I don't know what was wrong.
library(tidyverse)
library(plyr)
library(sf)
library(readxl)
> county <-st_read('D:/county.shp',stringsAsFactors = FALSE)
> neighbor_counties <- function(subcounty){
name <- st_touches(subcounty, county)
county[unlist(name), ]$NAME
}
> output <- vector("list", nrow(county))
> names(output) <- county$NAME
> for (i in seq_len(nrow(county))) {
output[[i]] <- suppressWarnings(neighbor_counties(county[i,]))
}
> output
> head(output)
> neighbor <- output %>%
ldply(data.frame) %>%
set_names("ori_county", "neighbor_county")
Your example is not exactly reproducible, but we are lucky to have the nc.shp shapefile that ships with {sf} available.
So consider this code; it is built on sf::st_touches() function, with the county shapefile passed as argument twice (once for the touching counties, and once for the counties being touched). Sparse = TRUE makes it return a list of indexes of neighboring counties.
To find names of neighbors of a particular county you need to know the index of the county of interest, and then subset the list of neighbors accordingly. You will get indices of the neighboring counties.
As for the second part of your question (expressed in comments) = how to get from a list of indices to a data frame of neighbors - I suggest creating a function returning a data frame, and then applying it via purrr::map_dfr() to the vector of indices as starting points; consider the code provided and amend as necessary. It should give you a start...
library(sf)
shape <- st_read(system.file("shape/nc.shp", package="sf")) # included with sf package
# a list of neighbors
neighbors <- st_touches(shape, # first
shape, # second
sparse = T)
# neighbors of County Mecklenburg (as in Charlotte of Mecklenburg-Strelitz)
# index of Mecklenburg cnty
idx_strelitz <- which(shape$NAME == 'Mecklenburg')
# index of neighbors of Mecklenburg cnty
nbr_mecklenburg <- neighbors[idx_strelitz][[1]]
# names of neighbours of cnty Meckl.
shape$NAME[nbr_mecklenburg]
# [1] "Iredell" "Lincoln" "Cabarrus" "Gaston" "Union"
# a visual check
plot(st_geometry(shape))
plot(shape[idx_strelitz, ], col = "blue", add = T)
plot(shape[nbr_mecklenburg,], col = "red", add = T)
# second question: get pairs of names as a data frame
# a function returning data frame of neighbors of a given cnty
nbr_pairs <- function(idx) {
data.frame(ori_county = rep(shape$NAME[idx], length(neighbors[[idx]])),
neighbor_county = shape$NAME[neighbors[[idx]]])
}
# check - cnty Mecklemburg
nbr_pairs(idx_strelitz)
# ori_county neighbor_county
# 1 Mecklenburg Iredell
# 2 Mecklenburg Lincoln
# 3 Mecklenburg Cabarrus
# 4 Mecklenburg Gaston
# 5 Mecklenburg Union
# apply to list of indices
pairs_of_names <- purrr::map_dfr(seq_along(neighbors),
nbr_pairs)

rolling join in data.table to multiple (GPS) points

Last week I asked this question: How to efficiently calculate distance between GPS points in one dataset and GPS points in another data set using data.table on the matching of GPS coordinates of a gps tracker with the Gps coordinates of bus stop. User Hugh, gave an answer that works many times faster than the solution I eventually came up with. However, there is one problem I cannot get fixed in his/her code (I feel like I really tried all iterations possibilities of the code) I need to get a rolling join in which multiple bus stops that are near a GPS point get listed. In the current code only the nearest is given, but a GPS coordinate can of course be near multiple bus stops. I know this means I will have to turn around the Y and the X in the rolling join somewhere , but somehow it just doesnt work. Can someone maybe point me towards a solution?
All the code and full explanation can be found in the original question, but this is how I create the data:
# create GPS data
number_of_GPS_coordinates <- 1000000
set.seed(1)
gpsdata<-data.frame(id=1:number_of_GPS_coordinates,
device_id = 1,
latitude=runif(number_of_GPS_coordinates,50.5,53.5),
longitude=runif(number_of_GPS_coordinates,4,7))
# create Bus stop data
set.seed(1)
number_of_bus_stops <- 55000
busdata<-data.frame(id=1:number_of_bus_stops,
name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")),
latitude=runif(number_of_bus_stops,50.5,53.5),
longitude=runif(number_of_bus_stops,4,7))
And this is the function that does work for me, but is still not very fast (much faster than my first attempt):
check_if_close <- function(dataset1 = GPS.Utrecht.to.Gouda,
dataset2 = bus_stops,
n.splits = 500,
desired.dist = .2){
# dataset1 needs at least the columns
# - "id",
# - "device_id"
# - "latitude"
# - "longitude"
# dataset2 needs at least the columns
# - "id",
# - "name"
# - "latitude"
# - "longitude"
# these are the average coordinates of the Netherlands. A change of ,.0017 in latitude leads to a change of 189 meters
# spDistsN1(matrix(c(5.2913, 52.1326), ncol=2), matrix(c(5.2913, 52.1326+.0017), ncol=2), longlat=TRUE)*1000
# [1] 189.1604
# this means that the latitude slices we can cut (the subsection of) the Netherlands is have to be at least .0017 wide.
# if we look at the Netherlands a whole this would mean we can use max (53.5-50.5)/.0017 = 1765 slices.
# if we look only at a small subsection (because we are only looking a a single trip for example we need much less slices.
# 1) we only select the variables we need from dataset 1
dataset1 <- setDT(dataset1)[,c("id", "device_id", "latitude", "longitude")]
setnames(dataset1, old = c("id", "latitude", "longitude"), new = c("id_dataset1", "latitude_gps", "longitude_gps"))
# 2) we only select the variables we need from dataset 2
dataset2 <- setDT(dataset2)[,c("id", "name", "latitude", "longitude")]
setnames(dataset2, old = c("id", "latitude", "longitude"), new = c("id_dataset2", "latitude_feature", "longitude_feature"))
# 3) only keep subet of dataset2 that falls within dataset 1.
# There is no reason to check if features are close that already fall out of the GPS coordinates in the trip we want to check
# We do add a 0.01 point margin around it to be on the save side. Maybe a feature falls just out the GPS coordinates,
# but is still near to a GPS point
dataset2 <- dataset2[latitude_feature %between% (range(dataset1$latitude_gps) + c(-0.01, +0.01))
& longitude_feature %between% (range(dataset1$longitude_gps) + c(-0.01, +0.01)), ]
# 4) we cut the dataset2 into slices on the latitude dimension
# some trial and error is involved getting the right amount. if you add to many you get a large and redudant amount of empty values
# if you add to few you get you need to check too many GPS to feauture distances per slice
dataset2[, range2 := as.numeric(Hmisc::cut2(dataset2$latitude_feature, g=n.splits))]
# 5) calculate the ranges of the slices we just created
ranges <- dataset2[,list(Min=min(latitude_feature), Max= max(latitude_feature)), by=range2][order(range2)]
setnames(ranges, old = c("range2", "Min", "Max"), new = c("latitude_range", "start", "end"))
# 6) now we assign too which slice every GPS coordinate in our dataset1 belongs
# this is super fast when using data.table grammar
elements1 <- dataset1$latitude_gps
ranges <- setDT(ranges)[data.table(elements1), on = .(start <= elements1, end >=elements1)]
ranges[, rowID := seq_len(.N)]
dataset1[,rowID := seq_len(.N)]
setkey(dataset1, rowID)
setkey(ranges, rowID)
dataset1<-dataset1[ranges]
# 7) this is the actual function we use to check if a datapoint is nearby.
# potentially there are faster function to do this??
checkdatapoint <- function(p, h, dist=desired.dist) {
distances <- spDistsN1(data.matrix(filter(dataset1,latitude_range==h)[,c("longitude_gps","latitude_gps")]),
p,
longlat=TRUE) # in km
return(which(distances <= dist)) # distance is now set to 200 meters
}
# 8) we assign a ID to the dataset1 starting again at every slice.
# we need this to later match the data again
dataset1[, ID2 := sequence(.N), by = latitude_range]
# 9) here we loop over all the splits and for every point check if there is a feature nearby in the slice it falls in
# to be on the save side we also check the slice left and right of it, just to make sure we do not miss features that
# are nearby, but just fall in a different slice.
# 9a: create an empty list we fill with dataframes later
TT<-vector("list", length=n.splits)
# 9b: loop over the number of slices using above defined function
for(i in 1:n.splits){
datapoints.near.feature<-apply(data.matrix(dataset2[range2 %in% c(i-1,i, i+1), c("longitude_feature","latitude_feature")]), 1, checkdatapoint, h=i)
# 9c: if in that slice there was no match between a GPS coordinate and an nearby feature, we create an empty list input
if(class(datapoints.near.feature)=="integer"|class(datapoints.near.feature)=="matrix"){
TT[[i]] <-NULL
} else {
# 9d: if there was a match we get a list of data point that are named
names(datapoints.near.feature) <- dataset2[range2 %in% c(i-1,i, i+1), name]
# 9e: then we 'melt' this list into data.frame
temp <- melt(datapoints.near.feature)
# 9f: then we transform it into a data.table and change the names
setDT(temp)
setnames(temp, old=c("value", "L1"), new= c("value", "feature_name"))
# 9h: then we only select the data point in dataset1 that fall in the current slice give them an
# ID and merge them with the file of nearby busstops
gpsdata.f <- dataset1[latitude_range==i, ]
gpsdata.f[, rowID2 := seq_len(.N)]
setkey(gpsdata.f, key = "rowID2")
setkey(temp, key = "value")
GPS.joined.temp <- merge(x = gpsdata.f, y = temp, by.x= "rowID2", by.y= "value", all.x=TRUE)
# 9i: we only keep the unique entries and for every slice save them to the list
GPS.joined.unique.temp <- unique(GPS.joined.temp, by=c("id_dataset1", "feature_name"))
TT[[i]] <- GPS.joined.unique.temp
cat(paste0(round(i/n.splits*100), '% completed'), " \r"); flush.console()
#cat(i/n.splits*100, " \r"); flush.console()
}
}
# 10) now we left join the original dataset and and the data point that are near a feature
finallist<- merge(x = dataset1,
y = rbindlist(TT[vapply(TT, Negate(is.null), NA)]),
by.x= "id_dataset1",
by.y= "id_dataset1",
all.x=TRUE)
# 11) we add a new logical variable to check if any bus stop is near
finallist[, nearby := TRUE][is.na(feature_name), nearby := FALSE] # add a dummy to check if any bus stop is nearby.
# 12) if a point is near multiple features at once these are listed in a vector,
# instead of having duplicate rows with teh same id but different features
finallist <- unique(setDT(finallist)[order(id_dataset1, feature_name), list(feature_name=list(feature_name), id=id_dataset1, lat=latitude_gps.x, lon=longitude_gps.x, nearby=nearby), by=id_dataset1], by="id_dataset1")
return(finallist)
}
This is the code #Hugh proposed, which is super fast and almost does the same thing. except that multiple bus stations to a point are not listed and only the nearest ones.
# create GPS data
number_of_GPS_coordinates <- 20000
set.seed(1)
gpsdata<-as.data.frame(cbind(id=1:number_of_GPS_coordinates,
lat=runif(number_of_GPS_coordinates,50.5,53.5),
lon=runif(number_of_GPS_coordinates,4,7)))
# create some busstop data. In this case only 2000 bus stops
set.seed(1)
number_of_bus_stops <- 2000
stop<-as.data.frame(gpsdata[sample(nrow(gpsdata), number_of_bus_stops), -1]) # of course do not keep id variable
stop$lat<-stop$lat+rnorm(number_of_bus_stops,0,.0005)
stop$lon<-stop$lon+rnorm(number_of_bus_stops,0,.0005)
busdata.data<-cbind(stop, name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")))
names(busdata.data) <- c("latitude_bustops", "longitude_bustops", "name")
library(data.table)
library(hutils)
setDT(gpsdata)
setDT(busdata.data)
gps_orig <- copy(gpsdata)
busdata.orig <- copy(busdata.data)
setkey(gpsdata, lat)
# Just to take note of the originals
gpsdata[, gps_lat := lat + 0]
gpsdata[, gps_lon := lon + 0]
busdata.data[, lat := latitude_bustops + 0]
busdata.data[, lon := longitude_bustops + 0]
setkey(busdata.data, lat)
gpsID_by_lat <-
gpsdata[, .(id), keyby = "lat"]
By_latitude <-
busdata.data[gpsdata,
on = "lat",
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L] %>%
.[, .(id_lat = id,
name_lat = name,
bus_lat = latitude_bustops,
bus_lon = longitude_bustops,
gps_lat,
gps_lon),
keyby = .(lon = gps_lon)]
setkey(busdata.data, lon)
By_latlon <-
busdata.data[By_latitude,
on = c("name==name_lat", "lon"),
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L]
By_latlon[, distance := haversine_distance(lat1 = gps_lat,
lon1 = gps_lon,
lat2 = bus_lat,
lon2 = bus_lon)]
By_latlon[distance < 0.2]
Can someone help me adjust the code in such a way that not only the nearest bus stop, but all stops within 200 meters are given (in a list, or separate rows)

How to efficiently calculate distance between GPS points in one dataset and GPS points in another data set using data.table

I am facing a coding (optimization) problem in R. I have a long data set with GPS coordinates (lon, lat, timestamp) and for every row I need to check whether the location is near a bus stop. I have a .csv file with all the bus stops (in the Netherlands). The GPS coordinates file is millions of entries long, but could be split if necessary. The bus stop dataset is around 5500 entries long.
Using the code and tips given on, inter alia, these pages:
1) How to efficiently calculate distance between pair of coordinates using data.table :=
2) Using a simple for loop on spatial data
3) Calculate distance between two latitude-longitude points? (Haversine formula)
4) Fastest way to determine COUNTRY from millions of GPS coordinates [R]
I was able to construct a code that works, but is (too) slow. I was wondering if someone can help me with a faster data.table() implementation or can point out where the bottle neck in my code is? Is it the spDistsN1() function, or maybe the apply and melt() functions combination? I am most comfortable in R, but open to other software (as long as it is open source).
Due to privacy concerns I cannot upload the full dataset, but this is a (small) reproducible example that is not too different from how the real data looks.
# packages:
library(data.table)
library(tidyverse)
library(sp)
# create GPS data
number_of_GPS_coordinates <- 20000
set.seed(1)
gpsdata<-as.data.frame(cbind(id=1:number_of_GPS_coordinates,
lat=runif(number_of_GPS_coordinates,50.5,53.5),
lon=runif(number_of_GPS_coordinates,4,7)))
# create some busstop data. In this case only 2000 bus stops
set.seed(1)
number_of_bus_stops <- 2000
stop<-as.data.frame(gpsdata[sample(nrow(gpsdata), number_of_bus_stops), -1]) # of course do not keep id variable
stop$lat<-stop$lat+rnorm(number_of_bus_stops,0,.0005)
stop$lon<-stop$lon+rnorm(number_of_bus_stops,0,.0005)
busdata.data<-cbind(stop, name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")))
names(busdata.data) <- c("latitude_bustops", "longitude_bustops", "name")
Download the real bus stop data if you want, kind of hard to reproduce a random sample of this.
#temp <- tempfile()
#download.file("http://data.openov.nl/haltes/stops.csv.gz", temp) #1.7MB
#gzfile(temp, 'rt')
#busstopdata <- read.csv(temp, stringsAsFactors = FALSE)
#unlink(temp)
#bus_stops <- fread("bus_stops.csv")
#busdata.data <- busstopdata %>%
# mutate(latitude_bustops = latitude)%>%
# mutate(longitude_bustops = longitude)%>%
# dplyr::select(name, latitude_bustops, longitude_bustops)
Code I use now to calculate distances. It works but it is pretty slow
countDataPoints3 <- function(p) {
distances <- spDistsN1(data.matrix(gpsdata[,c("lon","lat")]),
p,
longlat=TRUE) # in km
return(which(distances <= .2)) # distance is now set to 200 meters
}
# code to check per data point if a bus stop is near and save this per bus stop in a list entry
datapoints.by.bustation <- apply(data.matrix(busdata.data[,c("longitude_bustops","latitude_bustops")]), 1, countDataPoints3)
# rename list entries
names(datapoints.by.bustation) <- busdata.data$name
# melt list into one big data.frame
long.data.frame.busstops <- melt(datapoints.by.bustation)
# now switch to data.table grammar to speed up process
# set data.table
setDT(gpsdata)
gpsdata[, rowID := 1:nrow(gpsdata)]
setkey(gpsdata, key = "rowID")
setDT(long.data.frame.busstops)
# merge the data, and filter non-unique entries
setkey(long.data.frame.busstops, key = "value")
GPS.joined <- merge(x = gpsdata, y = long.data.frame.busstops, by.x= "rowID", by.y= "value", all.x=TRUE)
GPS.joined.unique <- unique(GPS.joined, by="id") # mak
# this last part of the code is needed to make sure that if there are more than 1 bus stop nearby it puts these bus stop in a list
# instead of adding row and making the final data.frame longer than the original one
GPS.joined.unique2 <- setDT(GPS.joined.unique)[order(id, L1), list(L1=list(L1)), by=id]
GPS.joined.unique2[, nearby := TRUE][is.na(L1), nearby := FALSE] # add a dummy to check if any bus stop is nearby.
# makes sense:
as.tibble(GPS.joined.unique2) %>%
summarize(sum = sum(nearby))
Consider cutting using an slicing method: first cut by close latitudes and close longitudes. In this case 0.5 latitude and 0.5 longitude (which is still about a 60 km disc). We can use data.table's superb support of rolling joins.
The following takes a few milliseconds for 20,000 entries and only a few seconds for 2M entries.
library(data.table)
library(hutils)
setDT(gpsdata)
setDT(busdata.data)
gps_orig <- copy(gpsdata)
busdata.orig <- copy(busdata.data)
setkey(gpsdata, lat)
# Just to take note of the originals
gpsdata[, gps_lat := lat + 0]
gpsdata[, gps_lon := lon + 0]
busdata.data[, lat := latitude_bustops + 0]
busdata.data[, lon := longitude_bustops + 0]
setkey(busdata.data, lat)
gpsID_by_lat <-
gpsdata[, .(id), keyby = "lat"]
By_latitude <-
busdata.data[gpsdata,
on = "lat",
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L] %>%
.[, .(id_lat = id,
name_lat = name,
bus_lat = latitude_bustops,
bus_lon = longitude_bustops,
gps_lat,
gps_lon),
keyby = .(lon = gps_lon)]
setkey(busdata.data, lon)
By_latlon <-
busdata.data[By_latitude,
on = c("name==name_lat", "lon"),
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L]
By_latlon[, distance := haversine_distance(lat1 = gps_lat,
lon1 = gps_lon,
lat2 = bus_lat,
lon2 = bus_lon)]
By_latlon[distance < 0.2]
This is the function I came up with so far. #Dave2e, thanks. It is already an awful lot faster than what I had. There still is clearly room for a lot of improvement, but as it stands it is fast enough for my analysis now. I only slice by latitude and not longitude. The only reason for that is that it makes indexing and then looping over indices really easy, but more speed could be gained by also indexing by longitude. Also, in real GPS data there tend to be many duplicate values (same lon/lat, different time stamp), the code would also be more efficient if it would take this into account. Maybe I will work on that in the future.
# this app could be much faster if it would filter by duplicate GPS coordinates
check_if_close <- function(dataset1 = GPS.Utrecht.to.Gouda,
dataset2 = bus_stops,
n.splits = 500,
desired.dist = .2){
# dataset1 needs at least the columns
# - "id",
# - "device_id"
# - "latitude"
# - "longitude"
# dataset2 needs at least the columns
# - "id",
# - "name"
# - "latitude"
# - "longitude"
# these are the average coordinates of the Netherlands. A change of ,.0017 in latitude leads to a change of 189 meters
# spDistsN1(matrix(c(5.2913, 52.1326), ncol=2), matrix(c(5.2913, 52.1326+.0017), ncol=2), longlat=TRUE)*1000
# [1] 189.1604
# this means that the latitude slices we can cut (the subsection of) the Netherlands is have to be at least .0017 wide.
# if we look at the Netherlands a whole this would mean we can use max (53.5-50.5)/.0017 = 1765 slices.
# if we look only at a small subsection (because we are only looking a a single trip for example we need much less slices.
# 1) we only select the variables we need from dataset 1
dataset1 <- setDT(dataset1)[,c("id", "device_id", "latitude", "longitude")]
setnames(dataset1, old = c("id", "latitude", "longitude"), new = c("id_dataset1", "latitude_gps", "longitude_gps"))
# 2) we only select the variables we need from dataset 2
dataset2 <- setDT(dataset2)[,c("id", "name", "latitude", "longitude")]
setnames(dataset2, old = c("id", "latitude", "longitude"), new = c("id_dataset2", "latitude_feature", "longitude_feature"))
# 3) only keep subet of dataset2 that falls within dataset 1.
# There is no reason to check if features are close that already fall out of the GPS coordinates in the trip we want to check
# We do add a 0.01 point margin around it to be on the save side. Maybe a feature falls just out the GPS coordinates,
# but is still near to a GPS point
dataset2 <- dataset2[latitude_feature %between% (range(dataset1$latitude_gps) + c(-0.01, +0.01))
& longitude_feature %between% (range(dataset1$longitude_gps) + c(-0.01, +0.01)), ]
# 4) we cut the dataset2 into slices on the latitude dimension
# some trial and error is involved getting the right amount. if you add to many you get a large and redudant amount of empty values
# if you add to few you get you need to check too many GPS to feauture distances per slice
dataset2[, range2 := as.numeric(Hmisc::cut2(dataset2$latitude_feature, g=n.splits))]
# 5) calculate the ranges of the slices we just created
ranges <- dataset2[,list(Min=min(latitude_feature), Max= max(latitude_feature)), by=range2][order(range2)]
setnames(ranges, old = c("range2", "Min", "Max"), new = c("latitude_range", "start", "end"))
# 6) now we assign too which slice every GPS coordinate in our dataset1 belongs
# this is super fast when using data.table grammar
elements1 <- dataset1$latitude_gps
ranges <- setDT(ranges)[data.table(elements1), on = .(start <= elements1, end >=elements1)]
ranges[, rowID := seq_len(.N)]
dataset1[,rowID := seq_len(.N)]
setkey(dataset1, rowID)
setkey(ranges, rowID)
dataset1<-dataset1[ranges]
# 7) this is the actual function we use to check if a datapoint is nearby.
# potentially there are faster function to do this??
checkdatapoint <- function(p, h, dist=desired.dist) {
distances <- spDistsN1(data.matrix(filter(dataset1,latitude_range==h)[,c("longitude_gps","latitude_gps")]),
p,
longlat=TRUE) # in km
return(which(distances <= dist)) # distance is now set to 200 meters
}
# 8) we assign a ID to the dataset1 starting again at every slice.
# we need this to later match the data again
dataset1[, ID2 := sequence(.N), by = latitude_range]
# 9) here we loop over all the splits and for every point check if there is a feature nearby in the slice it falls in
# to be on the save side we also check the slice left and right of it, just to make sure we do not miss features that
# are nearby, but just fall in a different slice.
# 9a: create an empty list we fill with dataframes later
TT<-vector("list", length=n.splits)
# 9b: loop over the number of slices using above defined function
for(i in 1:n.splits){
datapoints.near.feature<-apply(data.matrix(dataset2[range2 %in% c(i-1,i, i+1), c("longitude_feature","latitude_feature")]), 1, checkdatapoint, h=i)
# 9c: if in that slice there was no match between a GPS coordinate and an nearby feature, we create an empty list input
if(class(datapoints.near.feature)=="integer"|class(datapoints.near.feature)=="matrix"){
TT[[i]] <-NULL
} else {
# 9d: if there was a match we get a list of data point that are named
names(datapoints.near.feature) <- dataset2[range2 %in% c(i-1,i, i+1), name]
# 9e: then we 'melt' this list into data.frame
temp <- melt(datapoints.near.feature)
# 9f: then we transform it into a data.table and change the names
setDT(temp)
setnames(temp, old=c("value", "L1"), new= c("value", "feature_name"))
# 9h: then we only select the data point in dataset1 that fall in the current slice give them an
# ID and merge them with the file of nearby busstops
gpsdata.f <- dataset1[latitude_range==i, ]
gpsdata.f[, rowID2 := seq_len(.N)]
setkey(gpsdata.f, key = "rowID2")
setkey(temp, key = "value")
GPS.joined.temp <- merge(x = gpsdata.f, y = temp, by.x= "rowID2", by.y= "value", all.x=TRUE)
# 9i: we only keep the unique entries and for every slice save them to the list
GPS.joined.unique.temp <- unique(GPS.joined.temp, by=c("id_dataset1", "feature_name"))
TT[[i]] <- GPS.joined.unique.temp
cat(paste0(round(i/n.splits*100), '% completed'), " \r"); flush.console()
#cat(i/n.splits*100, " \r"); flush.console()
}
}
# 10) now we left join the original dataset and and the data point that are near a feature
finallist<- merge(x = dataset1,
y = rbindlist(TT[vapply(TT, Negate(is.null), NA)]),
by.x= "id_dataset1",
by.y= "id_dataset1",
all.x=TRUE)
# 11) we add a new logical variable to check if any bus stop is near
finallist[, nearby := TRUE][is.na(feature_name), nearby := FALSE] # add a dummy to check if any bus stop is nearby.
# 12) if a point is near multiple features at once these are listed in a vector,
# instead of having duplicate rows with teh same id but different features
finallist <- unique(setDT(finallist)[order(id_dataset1, feature_name), list(feature_name=list(feature_name), id=id_dataset1, lat=latitude_gps.x, lon=longitude_gps.x, nearby=nearby), by=id_dataset1], by="id_dataset1")
return(finallist)
}

Dissolve all polygons but one from a shape file

I want to dissolve all polygons but one from a shape file. Is there a way to do this?
Here is a reproducible example:
library(rgeos)
library(UScensus2000tract)
# load data
data("oregon.tract")
# plot map
plot(oregon.tract)
# Dissolve all polygons
d <- gUnaryUnion(oregon.tract, id = oregon.tract#data$state)
plot(d)
In this example, is it possible to dissolve the polygons by keep the tract number 9501 ?
I assume this is what you are looking for. This is slightly different if you want to merge contiguous members of the tract together, but all you would have to do is remove the first element (the entire state) from the polygon, and then run gUnaryUnion on the remainder, and then re-add the contiguous tract members to a copy of the gUnaryUnion-ized state.
oregon = oregon.tract
names(attributes(oregon.tract))
#[1] "bbox" "proj4string" "polygons" "plotOrder" "data"
#[6] "class"
selected_tract_indices = which(oregon.tract#data$tract == 9501)
oregon <- gUnaryUnion(oregon.tract, id = oregon.tract#data$state)
d = oregon
npolygons = 1
for (selected_tract_index in selected_tract_indices){
d#polygons[[npolygons+1]] = oregon.tract#polygons[[selected_tract_index]]
npolygons = npolygons + 1
d#plotOrder=c(d#plotOrder,as.integer(npolygons))
}
plot(d)
The output of this operation is a SpatialPolygon. In case you want to convert it back to a SpatialPolygonDataDrame, here is a simple way to do it:
# Extract polygon ID's
( did <- sapply(slot(d, "polygons"), function(x) slot(x, "ID")) )
# Create dataframe with correct rownames
( d.df <- data.frame( ID=1:length(d), row.names = did) )
# Try coersion again and check class
d <- SpatialPolygonsDataFrame(d, d.df)
class(d)

Resources