The goal is to average points together within 10 meters without repeating any points in the averaging, reduce the point dataframe to the averaged points, and to ideally obtain a smooth flow of points along the routes said points were collected. Here is an 11 point subset example dataframe from a much larger file (25,000 observations):
library(sf)
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)),
datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,
coords = c("lon", "lat"),
crs = projcrs)
Here is what I've tried:
Many iterations of st_is_within_distance(trav, trav, tolerance) including:
an aggregate method shown here. These don't work because the same points get averaged multiple times.
Got close with filter and across by trying to dynamically update a list in lapply but didn't work out in the end.
This is helpful from #jeffreyevans, but doesn't really solve the problem and is a bit outdated.
The spThin package doesn't work because it's made for more specific variables.
I thought to cluster using this post, but the clusters throw random points and don't actually reduce the dataframe efficiently.
Here is as close as I've gotten. Again, the issue with this solution is it repeats points in collecting averages, which gives more weight to certain points than others.
# first set tolerance
tolerance <- 20 # 20 meters
# get distance between points
i <- st_is_within_distance(df, df, tolerance)
# filter for indices with more than 1 (self) neighbor
i <- i[which(lengths(i) > 1)]
# filter for unique indices (point 1, 2 / point 2, 1)
i <- i[!duplicated(i)]
# points in `sf` object that have no neighbors within tolerance
no_neighbors <- trav[!(1:nrow(df) %in% unlist(i)), ]
# iterate over indices of neighboring points
avg_points <- lapply(i, function(b){
df <- df[unlist(b), ]
coords <- st_coordinates(df)
df <- df %>%
st_drop_geometry() %>%
cbind(., coords)
df_sum <- df %>%
summarise(
datetime = first(datetime),
trait = mean(trait),
X = mean(X),
Y = mean(Y),
.groups = 'drop') %>%
ungroup()
return(df)
}) %>%
bind_rows() %>%
st_as_sf(coords = c('X', 'Y'),
crs = "+proj=longlat +datum=WGS84 +no_defs ")
Another answer, using sf::aggregate() and a hexagonal grid to find points that are within a particular distance from each other. A square grid could be used as well. Results will vary some depending on where exactly the grid falls in relation to the points, but no point should be used more than once in determining the mean.
Outline of the steps:
load data, transform to crs 5070 for measurements in meters
get a bounding box of the data
make a grid of hexagons of the bounding box of ~10m diameter each
aggregate points falling in the same hexagon using mean
join to original data
library(sf)
library(tidyverse)
set.seed(22) # might be needed to get same hex grid?
#### your sample data
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)),
datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,
coords = c("lon", "lat"),
crs = projcrs) %>%
st_transform(5070) ### transform to 5070 for a projection in meters
#### end sample data
# Get a bounding box as an sf object to make a grid
bbox <- st_bbox(df) %>% st_as_sfc()
# Make a grid as hexagons with approximately the right size
# area ~86m; side ~5.75m; long diag ~11.5m
hex_grid <- st_make_grid(bbox, cellsize = 10, square = F) %>% st_as_sf()
# Aggregate mean of the hexagonal grid
hex_agg <- aggregate(df ,
hex_grid,
mean,
join = st_contains) %>% filter(!is.na(trait))
# Assign the mean of the hexagon to points that fall
# within each hexagon
df_agg <- st_join(df, hex_agg)
head(df_agg) # trait.x from df, trait.y from the mean by hexagon
#> Simple feature collection with 6 features and 4 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 121281.6 ymin: 1786179 xmax: 121285.3 ymax: 1786227
#> Projected CRS: NAD83 / Conus Albers
#> trait.x datetime.x trait.y datetime.y
#> 1 91.22 2021-08-06 15:08:43 91.70500 2021-08-06 15:26:00
#> 2 91.22 2021-08-06 15:08:44 91.32667 2021-08-06 15:31:47
#> 3 91.22 2021-08-06 15:08:46 91.22000 2021-08-06 15:08:46
#> 4 91.58 2021-08-06 15:08:47 91.58000 2021-08-06 15:08:47
#> 5 91.47 2021-08-06 15:43:17 91.47000 2021-08-06 15:43:17
#> 6 92.19 2021-08-06 15:43:18 91.70500 2021-08-06 15:26:00
#> geometry
#> 1 POINT (121282.5 1786184)
#> 2 POINT (121283.2 1786194)
#> 3 POINT (121284.6 1786216)
#> 4 POINT (121285.3 1786227)
#> 5 POINT (121281.7 1786179)
#> 6 POINT (121281.6 1786186)
sum(df_agg$trait.x) - sum(df_agg$trait.y) # original trait - aggregate trait should be 0, or near 0
#> [1] 0
ggplot(df_agg) +
geom_sf(aes(size = trait.x), alpha = .2, color = 'blue') + # Original triat
geom_sf(aes(size = trait.y), alpha = .2, color = 'red') + # New aggregated trait
theme_void()
Sized by trait. Blue points are original, red is the new spatial mean.
## Plot of
# original points & hex grid used:
ggplot() +
geom_sf(data = df, color = 'red') +
geom_sf(data = hex_grid, fill = NA) +
theme_void()
Plot showing the grouping of the points for the mean. Looks like there were groups of 1,2, and 3 points per hexagon for the mean.
Created on 2022-03-23 by the reprex package (v2.0.1)
Edit
Updated to have only one point per hexagon, losing some of the original points
## Edit for one point per hexagon:
hex_agg <- aggregate(df ,
hex_grid,
mean,
join = st_contains) %>% filter(!is.na(trait)) %>%
rownames_to_column('hex_num') # add hexagon number to group_by
## Guide to join on, has only hexagon number & centroid of contained points
hex_guide <- df_agg %>%
group_by(hex_num) %>%
summarise() %>%
st_centroid()
# The full sf object with only one point per hexagon
# this join isn't the most efficient, but slice(1) removes
# the duplicate data. You could clean df_agg before the join
# to resolve this
final_join <- df_agg %>%
st_drop_geometry() %>%
left_join(hex_guide, by = 'hex_num') %>%
group_by(hex_num) %>%
slice(1) %>%
ungroup() %>%
st_as_sf()
ggplot() +
geom_sf(data = final_join, color = 'red', size = 3) +
geom_sf(data = df, color = 'black', alpha = .5) +
geom_sf(data = hex_grid, color = 'blue', fill = NA)
The plot shows the hexagons, original data points in grey, and new red points at the centroid of grouped original points. Only 1 red point per hexagon.
I'm not sure, but perhaps this is what you are looking for?
You can experiment with the different settings/methods of smoothr::smooth() to get the desired results.
library(tidyverse)
library(igraph)
library(smoothr)
library(mapview) # for viewing purposes only
# get a matrix of points <10 meter apart
m <- st_is_within_distance(df, dist = 10, sparse = FALSE)
# creata an igraph from the matrix
g <- graph.adjacency(m, mode="undirected", diag = FALSE)
plot(g)
points that are are withing 10 metres of eachother?
# pass cluster-number to df object
df$id <- as.vector(components(G)$membership)
# create polylines (only if more than 1 point!)
df.lines <- df %>%
group_by(id) %>%
dplyr::add_tally() %>%
dplyr::filter(n > 1) %>%
summarise(geometry = st_combine(geometry)) %>%
st_cast("LINESTRING") %>%
# create smooth lines
smoothr::smooth(method = "ksmooth")
#view points and lines
mapview::mapview(list(df, df.lines))
If I understand your problem correctly, all boils down to selecting the "right" neighbors, i.e. those within a certain neighborhood, which were not used yet. If there is no such neighbor, simply use the point itself (even if it was used already in the averaging for another point).
Here's a solution using purrr::accumulate to first produce the correct indices and then simply use these indices to do the averaging:
library(purrr)
library(dplyr)
idx <- accumulate(i[-1L], function(x, y) {
x$points <- setdiff(y, x$used)
x$used <- union(x$used, y)
x
}, .init = list(used = i[[1L]], points = i[[1L]]))
idx[1:4]
# [[1]]
# [[1]]$used
# [1] 1 2 5 6 7 8 9
#
# [[1]]$points
# [1] 1 2 5 6 7 8 9
#
#
# [[2]]
# [[2]]$used
# [1] 1 2 5 6 7 8 9 10 11
#
# [[2]]$points
# [1] 10 11
#
#
# [[3]]
# [[3]]$used
# [1] 1 2 5 6 7 8 9 10 11 3 4
#
# [[3]]$points
# [1] 3 4
#
#
# [[4]]
# [[4]]$used
# [1] 1 2 5 6 7 8 9 10 11 3 4
#
# [[4]]$points
# integer(0)
The idea is that we maintain a list of used indices, that is, the ones which already used in any of the neighborhoods and the remainders (points). For instance, for the first point we use points at indices 1,2, 5, 6, 7, 8, 9 which leaves only indices 10, 11 for the second point. If there is no point left, we return integer(0).
Now that we have set up the indices list, the rest is easy, by looping through the list, selecting the indicated points (using the point itself in case there is no point left) and doing the avering:
idx %>%
imap_dfr(function(x, y) {
if (!length(x$points)) {
idx <- y
} else {
idx <- x$points
}
df[idx, , drop = FALSE] %>%
bind_cols(st_coordinates(.) %>% as_tibble()) %>%
st_drop_geometry() %>%
summarise(datetime = first(datetime),
trait = mean(trait),
X = mean(X),
Y = mean(Y))
}) %>%
st_as_sf(coords = c('X', 'Y'),
crs = "+proj=longlat +datum=WGS84 +no_defs ")
# Simple feature collection with 11 features and 2 fields
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: -94.58476 ymin: 39.09248 xmax: -94.58459 ymax: 39.09291
# CRS: +proj=longlat +datum=WGS84 +no_defs
# First 10 features:
# datetime trait geometry
# 1 2021-08-06 15:08:43 91.34714 POINT (-94.58464 39.09259)
# 2 2021-08-06 15:43:22 91.65000 POINT (-94.58473 39.09274)
# 3 2021-08-06 15:08:46 91.40000 POINT (-94.5846 39.09286)
# 4 2021-08-06 15:08:47 91.58000 POINT (-94.58459 39.09291)
# 5 2021-08-06 15:43:17 91.47000 POINT (-94.58464 39.09248)
# 6 2021-08-06 15:43:18 92.19000 POINT (-94.58464 39.09255)
# 7 2021-08-06 15:43:19 92.19000 POINT (-94.58464 39.09261)
# 8 2021-08-06 15:43:20 90.57000 POINT (-94.58464 39.09266)
# 9 2021-08-06 15:43:21 90.57000 POINT (-94.58466 39.0927)
# 10 2021-08-06 15:43:22 91.65000 POINT (-94.5847 39.09273)
If the goal is to not weight any point more than any other point in the cluster averages, it would be more balanced to use weighted averages rather than trying to force each cluster to contain a set of points unique from all other clusters.
One way to think of the below methodology is to "chop up" each observation and divvy up the pieces into clusters in such a way that the weight of the pieces in each cluster sums to 1.
This will probably be too expensive for 25k observations, so one option could be to perform this on overlapping or non-overlapping segments and stitch them together.
library(sf)
library(Rfast) # for the 'eachrow' function
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)),
datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,
coords = c("lon", "lat"),
crs = projcrs)
n <- nrow(df)
# sum the trait column for a sanity check after calculations
sumtrait <- sum(df$trait)
# first set tolerance
tolerance <- 20 # 20 meters
tol <- 1e-5 # tolerance for the weight matrix marginal sums
# create clusters of points grouped by circles centered at each point
i <- st_is_within_distance(df, df, tolerance)
# Initialize a matrix for the weight of each point within each cluster. The
# initial value represents an unweighted average for each cluster, so the row
# sums are not necessarily 1.
sz <- lengths(i)
w <- replace(matrix(0, n, n), unlist(sapply(1:n, function(x) i[[x]] + n*(x - 1))), rep.int(1/sz, sz))
# iteratively adjust the weights until the marginal sums all equal 1 (within
# tolerance)
marg <- rowSums(w)
while (max(abs(marg - 1)) > tol) {
w <- w/marg
marg <- colSums(w)
w <- eachrow(w, marg, "/")
marg <- rowSums(w)
}
df$trait <- colSums(w*df$trait)
print(df, n = nrow(df))
#> Simple feature collection with 11 features and 2 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -94.58476 ymin: 39.09248 xmax: -94.58459 ymax: 39.09291
#> CRS: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
#> trait datetime geometry
#> 1 91.37719 2021-08-06 15:08:43 POINT (-94.58463 39.09253)
#> 2 91.44430 2021-08-06 15:08:44 POINT (-94.58462 39.09262)
#> 3 91.31374 2021-08-06 15:08:46 POINT (-94.5846 39.09281)
#> 4 91.46755 2021-08-06 15:08:47 POINT (-94.58459 39.09291)
#> 5 91.64053 2021-08-06 15:43:17 POINT (-94.58464 39.09248)
#> 6 91.37719 2021-08-06 15:43:18 POINT (-94.58464 39.09255)
#> 7 91.44430 2021-08-06 15:43:19 POINT (-94.58464 39.09261)
#> 8 91.41380 2021-08-06 15:43:20 POINT (-94.58464 39.09266)
#> 9 91.41380 2021-08-06 15:43:21 POINT (-94.58466 39.0927)
#> 10 91.31880 2021-08-06 15:43:22 POINT (-94.5847 39.09273)
#> 11 91.31880 2021-08-06 15:43:23 POINT (-94.58476 39.09274)
# check that the sum of the "traits" column is unchanged
sum(df$trait) - sumtrait
#> [1] 4.875536e-07
UPDATE: If an exclusive grouping method is really needed, this implements a greedy algorithm:
avg_points <- numeric(nrow(df))
clusters <- vector("list", nrow(df))
currclust <- 0L
df$unused <- TRUE
for (cl in seq_along(df)) {
if (sum(df$unused[i[[cl]]])) {
currclust <- currclust + 1L
avg_points[currclust] <- mean(df$trait[i[[cl]]][df$unused[i[[cl]]]])
clusters[[currclust]] <- i[[cl]][df$unused[i[[cl]]]]
df$unused[i[[cl]]] <- FALSE
}
}
avg_points <- avg_points[1:currclust]
clusters <- clusters[1:currclust]
avg_points
#> [1] 91.34714 91.65000 91.40000
clusters
#> [[1]]
#> [1] 1 2 5 6 7 8 9
#>
#> [[2]]
#> [1] 10 11
#>
#> [[3]]
#> [1] 3 4
Note that the issue of uneven weightings is still present--the observations in group 1 are each weighted 1/7, while the observations in groups 2 and 3 are each weighted 1/2.
Related
Context
I am in the process of developing a custom star map package called starBliss. While I thought I sucessfully figured out what I needed to from my question, it looks like there is some edge cases where things start to break.
For more background check out the GitHub Issue here
The Question
I am presently working on a function that takes a constellation lines shapefile and re-projects it based on longitude and latitude values with a Lambert Azimuthal Equal Area projection to create the the night sky of a given location at any given time.
A successful implementation can be seen in with the present starBliss package:
#devtools::install_github("benyamindsmith/starBliss")
library(ggplot2)
library(starBliss)
p<- plot_starmap(location= "Toronto, ON, Canada",
date="2022-01-17",
style="black",
line1_text="Toronto",
line2_text ="January 17th, 2023",
line3_text="43.6532° N, 79.3832° W")
ggsave('toronto_black.png', plot = p, width = unit(10, 'in'),
height = unit(15, 'in'))
However this approach does run into problems. For example:
(image is cropped)
plot_starmap(
location= "Caracas, Venezuela",
date = as.Date("1991-03-17"),
style = "black")
The above code creates some off diagonal lines (circled in red)
A reproducible example
I put together a function that will get the constellation lines data and transform it as it presently is. When I use that and plot in ggplot2 it with geom_sf the problem still exists.
library(tidyverse)
library(sf)
library(tidygeocoder)
library(lubridate)
custom_starmap <- function(location,
date){
# Formatting Date properly
date<- as.Date(date)
# Formatted date
dt<- lubridate::ymd(date)
# Get Latitude and Longitude for ProjString
# For Latitude
suppressMessages(
capture.output(
lat <- tibble(singlelineaddress = location) %>%
geocode(address=singlelineaddress,method = 'arcgis') %>% .[["lat"]]
)
)
# Reference date used for calculating longitude
ref_date <- paste0(year(dt),"01","01",sep="-") %>% ymd()
# Resulting longitude
lon <- (-as.numeric(difftime(ref_date,dt, units="days"))/365)*360
# The CRS
projString <- paste0("+proj=laea +x_0=0 +y_0=0 +lon_0=",round(lon,4), " +lat_0=", round(lat,4))
# Data Transformation
flip <- matrix(c(-1, 0, 0, 1), 2, 2)
hemisphere <- st_sfc(st_point(c(lon, lat)), crs = 4326) %>%
st_buffer(dist = 1e7) %>%
st_transform(crs = projString)
# Data source for constellation lines
url1 <- "https://raw.githubusercontent.com/benyamindsmith/starBliss/main/data/constellations.lines.json"
# Reading Data
invisible(
capture.output(
constellation_lines_sf <- invisible(st_read(url1, stringsAsFactors = FALSE)) %>%
st_wrap_dateline(options = c("WRAPDATELINE=YES", "DATELINEOFFSET=360")) %>%
st_transform(crs = projString) %>%
st_intersection(hemisphere) %>%
filter(!is.na(st_is_valid(.))) %>%
mutate(geometry = geometry * flip)
)
)
st_crs(constellation_lines_sf) <- projString
return(constellation_lines_sf)
}
# The data
df<-custom_starmap(location= "Caracas, Venezuela",
date = as.Date("1991-03-17"))
df
> Simple feature collection with 49 features and 2 fields
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: 8967611 ymin: -8898251 xmax: -8714977 ymax: 9004400
CRS: +proj=laea +x_0=0 +y_0=0 +lon_0=73.9726 +lat_0=10.488
First 10 features:
id rank geometry
1 And 1 MULTILINESTRING ((3542468 3...
2 Ant 3 LINESTRING (-6234955 -52010...
3 Aqr 2 MULTILINESTRING ((8967611 -...
4 Ari 1 LINESTRING (3098546 2071855...
5 Aur 1 MULTILINESTRING ((-1307725 ...
6 Cae 3 LINESTRING (557848.4 -59059...
7 Cam 2 MULTILINESTRING ((-24783.5 ...
8 Cnc 2 MULTILINESTRING ((-6264812 ...
9 CMa 1 MULTILINESTRING ((-2356827 ...
10 CMi 2 LINESTRING (-4432439 -32157..
When I plot this data the lines in question can be seen:
df %>% ggplot()+geom_sf()
(circled in red for clarity)
How do I fix this? Is there an issue with the format of the CRS that I am using? Or do I need to crop the lines?
I think it is safer to use s2 for this type of exercise:
library(tidyverse)
library(sf)
#> Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(tidygeocoder)
library(lubridate)
#> Loading required package: timechange
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
custom_starmap <- function(location,
date){
# Formatting Date properly
date<- as.Date(date)
# Formatted date
dt<- lubridate::ymd(date)
# Get Latitude and Longitude for ProjString
# For Latitude
suppressMessages(
capture.output(
lat <- tibble(singlelineaddress = location) %>%
geocode(address=singlelineaddress,method = 'arcgis') %>% .[["lat"]]
)
)
# Reference date used for calculating longitude
ref_date <- paste0(year(dt),"01","01",sep="-") %>% ymd()
# Resulting longitude
lon <- (-as.numeric(difftime(ref_date,dt, units="days"))/365)*360
# The CRS
projString <- paste0("+proj=laea +x_0=0 +y_0=0 +lon_0=",round(lon,4), " +lat_0=", round(lat,4))
# Data Transformation
flip <- matrix(c(-1, 0, 0, 1), 2, 2)
# Hemisphere with s2
hemisphere <- s2::s2_buffer_cells(
s2::as_s2_geography(paste0("POINT(", lon, " ", lat, ")")),
distance = 1e7,
max_cells = 5000)
# Data source for constellation lines
url1 <- "https://raw.githubusercontent.com/benyamindsmith/starBliss/main/data/constellations.lines.json"
# Reading Data
invisible(
capture.output(
constellation_lines_sf <- invisible(st_read(url1, stringsAsFactors = FALSE)) %>%
st_wrap_dateline(options = c("WRAPDATELINE=YES", "DATELINEOFFSET=360")) %>%
# Use s2 for the cut
st_as_s2() %>%
s2::s2_intersection(hemisphere) %>%
# Back to sf
st_as_sf() %>%
st_transform(crs = projString) %>%
filter(!is.na(st_is_valid(.))) %>%
mutate(geometry = geometry * flip) %>%
# Filter if empty, since the cut can produce empty geometries
filter(!st_is_empty(.))
)
)
st_crs(constellation_lines_sf) <- projString
return(constellation_lines_sf)
}
# The data
df<-custom_starmap(location= "Caracas, Venezuela",
date = as.Date("1991-03-17"))
df
#> Simple feature collection with 48 features and 0 fields
#> Geometry type: GEOMETRY
#> Dimension: XY
#> Bounding box: xmin: -8700015 ymin: -8913303 xmax: 8922028 ymax: 8998639
#> CRS: +proj=laea +x_0=0 +y_0=0 +lon_0=73.9726 +lat_0=10.488
#> First 10 features:
#> geometry
#> 1 MULTILINESTRING ((3542468 3...
#> 2 LINESTRING (-6234955 -52010...
#> 3 MULTILINESTRING ((8922028 -...
#> 4 LINESTRING (3098546 2071855...
#> 5 MULTILINESTRING ((-1307725 ...
#> 6 LINESTRING (557848.4 -59059...
#> 7 MULTILINESTRING ((-24783.5 ...
#> 8 MULTILINESTRING ((-6264812 ...
#> 9 MULTILINESTRING ((-2356827 ...
#> 10 LINESTRING (-4432439 -32157...
ggplot(df) +
geom_sf()
Created on 2023-01-23 with reprex v2.0.2
I'm trying to use the geosphere package in R to get the distance to a polygon from a set of points that lie outside of that polygon.
The polygon is a shapefile of the Antarctic coastline, found here: https://data.bas.ac.uk/items/e6cf8946-e493-4c36-b4f5-58f7a2ee2a74/ and the points are animal tracking data.
I have tried using the syntax specified in the geosphere documentation (https://www.rdocumentation.org/packages/geosphere/versions/1.5-14/topics/dist2Line)
which is as follows:
dist2Line(p, line, distfun=distGeo)
#my attempt so far:
#libraries
library(rgdal)
library(sf)
library(rgeos)
library(tidyverse)
library(geosphere)
#my points
points <-read.csv("Analyses/example_points.csv") #this is the table included below of 4 example locations.
|ID|LON |LAT |
|--|----------|----------|
|a |-2.515478 |-69.53887 |
|b |-2.601405 |-69.79783 |
|c |-0.153548 |-69.45126 |
|d |26.06987 |-69.55020 |
#my line
line <- <- readOGR('Environmental_Data/COAST/add_coastline_high_res_polygon_v7_5.shp/') #this is the shapefile linked above
#convert points to spatial object
coordinates(points) <- ~LON+LAT
distance <- geosphere::dist2Line(p = points, line = line, distfun = distGEO)
However, I get an error: "Error in .spDistPoint2Line(p, line, distfun) :
Points are projected. They should be in degrees (longitude/latitude)".
The package documentation states that p can be:
"longitude/latitude of point(s). Can be a vector of two numbers, a matrix of 2 columns (first one is longitude, second is latitude) or a SpatialPoints object*" - which is what I'm providing it with. I have seen the same issue encountered on a Reddit post (unanswered) but not on here.
My desired output is as below (distances under distance to coast are made up for now!). I have ~3000 locations I need to find the distance to the coastline for.
ID
LON
LAT
Dist_to_coast (km)
a
-2.515478
-69.53887
40
b
-2.601405
-69.79783
24
c
-0.153548
-69.45126
74
d
26.06987
-69.55020
23
Is there an alternative/better means of doing this?
Thank you.
You have loaded sf, any particular reason for not using sf::st_distance() for the task? Would still need to transform though, as there are 4 sample points vs ~140MB shapefile with ~17000 polygons, points were transformed:
library(ggplot2)
library(dplyr)
library(sf)
coastline <- st_read("add_coastline_high_res_polygon_v7_6.shp/")
p <- readr::read_delim(
"ID|LON |LAT
a |-2.515478 |-69.53887
b |-2.601405 |-69.79783
c |-0.153548 |-69.45126
d |26.06987 |-69.55020" , delim = "|", trim_ws = T) %>%
st_as_sf(coords = c("LON", "LAT"), crs = "WGS84") %>%
# transform points to match crs of the shapefile
st_transform(st_crs(coastline))
# number of different surface polygons
table(coastline$surface)
#>
#> ice shelf ice tongue land rumple
#> 325 37 17233 64
# create a single multipolygon, can take a while;
# you may need to filter first to define any surface types you might want to
# include / exclude ("land" also includes islands)
system.time({
ucoastline <- st_union(coastline)
})
#> user system elapsed
#> 103.40 11.72 116.08
p$dist_ucoastline <- st_distance(p,ucoastline)
# or perhaps select land polygon with max area to
# ignore ice and all the islands:
land_max <- coastline %>%
slice_max(st_area(.))
p$land_max <- st_distance(p,land_max)
ggplot() +
geom_sf(data = st_simplify(ucoastline,dTolerance = 1000), fill = "lightblue", color = NA) +
geom_sf(data = st_simplify(land_max,dTolerance = 1000), fill = "gray70") +
geom_sf(data = p, shape =4, color="red", size = 5) +
theme_bw()
Result:
# convert coordinates back to WGS84,
# geometries to coordinate columns
bind_cols(
st_transform(p, crs = "WGS84") %>% st_coordinates(),
st_drop_geometry(p)
)
#> # A tibble: 4 × 5
#> X Y ID dist_ucoastline[,1] land_max[,1]
#> <dbl> <dbl> <chr> [m] [m]
#> 1 -2.52 -69.5 a 40742. 180479.
#> 2 -2.60 -69.8 b 39750. 157043.
#> 3 -0.154 -69.5 c 6629. 186878.
#> 4 26.1 -69.6 d 45683. 121500.
Created on 2022-11-23 with reprex v2.0.2
I have a bunch of points where I want to calculate the average summarized for each grouping variable:
x = st_sfc(st_polygon(list(rbind(c(0,0),c(90,0),c(90,90),c(0,90),c(0,0)))), crs = st_crs(4326))
plot(x, axes = TRUE, graticule = TRUE)
plot(p <- st_sample(x, 7), add = TRUE)
p=st_as_sf(p)
p$test=c("A","A","B","C","C","D","D")
When using dplyr, like this, I get an NA.
p %>%
group_by(test) %>%
summarize(geometry = mean(geometry))
I just want the average into the geometry, not 1 point, nor multipoints.
Not sure to fully understand what you are looking for but I am giving it a try!
So, please find one possible solution with a reprex below using sf and dplyr libraries. I guess you were looking for the aggregate() function instead of group_by()
Reprex
Code
library(sf)
library(dplyr)
R1 <- p %>% aggregate(.,
by = list(.$test),
function(x) x = x[1]) %>%
st_centroid() %>%
select(-Group.1)
#> Warning in st_centroid.sf(.): st_centroid assumes attributes are constant over
#> geometries of x
Output 1 (sf object)
R1
#> Simple feature collection with 4 features and 1 field
#> Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 2.7875 ymin: 12.91954 xmax: 59.60413 ymax: 51.81421
#> Geodetic CRS: WGS 84
#> test geometry
#> 1 A POINT (27.17167 12.91954)
#> 2 B POINT (2.7875 22.54184)
#> 3 C POINT (59.60413 46.90029)
#> 4 D POINT (56.34763 51.81421)
Complementary code and Output 2 (i.e. if you just need a dataframe)
R2 <- R1 %>%
st_coordinates() %>%
cbind(st_drop_geometry(R1),.)
R2
#> test X Y
#> 1 A 27.17167 12.91954
#> 2 B 2.78750 22.54184
#> 3 C 59.60413 46.90029
#> 4 D 56.34763 51.81421
Visualization
plot(x)
plot(p, add = TRUE)
plot(R1, pch = 15, add = TRUE)
Points are your data and small squares are centroids for each group
(FYI, I set the seed to 427 for reproducibility purpose)
NB: The above uses spherical geometry. If you want to do planar computations you just need to add sf_use_s2(FALSE) at the beginning of the script. To show you the difference, here is the result using sf_use_s2(FALSE) (in this case, you can see that, for each group, the centroid is located precisely on the line connecting the two points;
it is up to you to choose according to your needs)
Created on 2022-01-03 by the reprex package (v2.0.1)
I have a data.table with 957 geocodes. I want to match it with another dataset with 317 geocodes. The matching condition is geospatial proximity. I want to match each observation from the first dataset to an observation from the second one such that the distance between both observations is 5000 meters or less.
My data looks like this:
> muni[1:3]
mun Lat_Decimal Lon_Decimal
1: 1001 21.76672 -102.2818
2: 1002 22.16597 -102.0657
3: 1003 21.86138 -102.7248
> stations[1:3]
station_number station_lat station_long
1: 10003 25.100 -106.567
2: 10018 24.944 -106.259
3: 10031 24.523 -105.952
I am using the distm function from library(geosphere) to calculate the distance.
I figured the way to attack this problem is a while loop. The idea is to take the first observation from muni and measure the distance to the first observation in stations. If the distance is 5000 meters or less, then assign the station_number of the first observation in station to the first observation in muni. If the distance is greater than 5000, then try the next observation in muni until the distance is 5000 meters or less.
Essentially, it's a loop that finds the first observation in stations that's 5000 meters or closer to an observation in muni.
This is a preliminary attempt at it:
for (i in 1:957) {
j = 1
while (distm(muni[i, .(Lon_Decimal, Lat_Decimal)],
stations[j, .(station_long, station_lat)]) > 5000 & j <= 317) {
muni[i, station_number := as.integer(stations[j, station_number])]
muni[i, distance := distm(muni[i, .(Lon_Decimal, Lat_Decimal)],
stations[j, .(station_long, station_lat)])]
j = j + 1
}
}
I can tell this is not working because none of the rows in ´muni´ appear to have been overwritten after running this loop for (i in 1:3). I suppose there is an error in my loop that is ignoring the station_number := and distance := parts.
I would expect this loop to overwrite muni such that all the entire column had a station_number.
I've read your few sample points as data.frames and converted them to sf below for the answer. If you're attached to geosphere, forgive the pun, everything should still apply the same, given that geosphere::distm also returns a matrix of distances.
First we get your data into sf format:
library(sf)
stations_raw <- "station_number station_lat station_long
1: 10003 25.100 -106.567
2: 10018 24.944 -106.259
3: 10031 24.523 -105.952"
mun_raw <- "mun Lat_Decimal Lon_Decimal
1: 1001 21.76672 -102.2818
2: 1002 22.16597 -102.0657
3: 1003 21.86138 -102.7248"
mun_df <- read.table(text = mun_raw)
stations_df <- read.table(text = stations_raw)
mun_sf <- st_as_sf(mun_df, coords = c("Lon_Decimal", "Lat_Decimal"), crs = 4326)
stations_sf <- st_as_sf(stations_df,
coords = c("station_long", "station_lat"), crs = 4326)
Then, we find the minimum for each interaction between dots:
closest <- list()
for(i in seq_len(nrow(mun_sf))){
closest[[i]] <- stations_sf[which.min(
st_distance(stations_sf, mun_sf[i,])),]
}
Finally, we extract the identifiers and attach them to the original df, removing the mun_id as you request:
mun_sf$closest_station <- purrr::map_chr(closest, "station_number")
mun_sf <- mun_sf[, c("closest_station", "geometry")]
mun_sf
#> Simple feature collection with 3 features and 1 field
#> geometry type: POINT
#> dimension: XY
#> bbox: xmin: -102.7248 ymin: 21.76672 xmax: -102.0657 ymax: 22.16597
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
#> closest_station geometry
#> 1: 10031 POINT (-102.2818 21.76672)
#> 2: 10031 POINT (-102.0657 22.16597)
#> 3: 10031 POINT (-102.7248 21.86138)
The plot below helps visually check that, in this toy example, we've got the right answer.
ggplot() +
geom_sf(data = mun_sf, colour = "red") +
geom_sf_text(data = mun_sf, aes(label = mun), nudge_x = 0.25) +
geom_sf(data = stations_sf, colour = "blue") +
geom_sf_text(data = stations_sf, aes(label = station_number), nudge_x = -0.25)
#> Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may
#> not give correct results for longitude/latitude data
#> Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may
#> not give correct results for longitude/latitude data
I'm trying to place a grid over San Jose like this:
Grid of San Jose
You can make the grid visually using the following code:
ca_cities = tigris::places(state = "CA") #using tigris package to get shape file of all CA cities
sj = ca_cities[ca_cities$NAME == "San Jose",] #specifying to San Jose
UTM_ZONE = "10" #the UTM zone for San Jose, will be used to convert the proj4string of sj into UTM
main_sj = sj#polygons[[1]]#Polygons[[5]] #the portion of the shape file I focus on. This is the boundary of san jose
#converting the main_sj polygon into a spatialpolygondataframe using the sp package
tst_ps = sp::Polygons(list(main_sj), 1)
tst_sps = sp::SpatialPolygons(list(tst_ps))
proj4string(tst_sps) = proj4string(sj)
df = data.frame(f = 99.9)
tst_spdf = sp::SpatialPolygonsDataFrame(tst_sps, data = df)
#transforming the proj4string and declaring the finished map as "map"
map = sp::spTransform(tst_sps, CRS(paste0("+proj=utm +zone=",UTM_ZONE," ellps=WGS84")))
#designates the number of horizontal and vertical lines of the grid
NUM_LINES_VERT = 25
NUM_LINES_HORZ = 25
#getting bounding box of map
bbox = map#bbox
#Marking the x and y coordinates for each of the grid lines.
x_spots = seq(bbox[1,1], bbox[1,2], length.out = NUM_LINES_HORZ)
y_spots = seq(bbox[2,1], bbox[2,2], length.out = NUM_LINES_VERT)
#creating the coordinates for the lines. top and bottom connect to each other. left and right connect to each other
top_vert_line_coords = expand.grid(x = x_spots, y = y_spots[1])
bottom_vert_line_coords = expand.grid(x = x_spots, y = y_spots[length(y_spots)])
left_horz_line_coords = expand.grid(x = x_spots[1], y = y_spots)
right_horz_line_coords = expand.grid(x = x_spots[length(x_spots)], y = y_spots)
#creating vertical lines and adding them all to a list
vert_line_list = list()
for(n in 1 : nrow(top_vert_line_coords)){
vert_line_list[[n]] = sp::Line(rbind(top_vert_line_coords[n,], bottom_vert_line_coords[n,]))
}
vert_lines = sp::Lines(vert_line_list, ID = "vert") #creating Lines object of the vertical lines
#creating horizontal lines and adding them all to a list
horz_line_list = list()
for(n in 1 : nrow(top_vert_line_coords)){
horz_line_list[[n]] = sp::Line(rbind(left_horz_line_coords[n,], right_horz_line_coords[n,]))
}
horz_lines = sp::Lines(horz_line_list, ID = "horz") #creating Lines object of the horizontal lines
all_lines = sp::Lines(c(horz_line_list, vert_line_list), ID = 1) #combining horizontal and vertical lines into a single grid format
grid_lines = sp::SpatialLines(list(all_lines)) #converting the lines object into a Spatial Lines object
proj4string(grid_lines) = proj4string(map) #ensuring the projections are the same between the map and the grid lines.
trimmed_grid = intersect(grid_lines, map) #grid that shapes to the san jose map
plot(map) #plotting the map of San Jose
lines(trimmed_grid) #plotting the grid
However, I am struggling to turn each grid 'square' (some of the grid pieces are not squares since they fit to the shape of the san jose map) into a bin which I could input data into. Put another way, if each grid 'square' was numbered 1:n, then I could make a dataframe like this:
grid_id num_assaults num_thefts
1 1 100 89
2 2 55 456
3 3 12 1321
4 4 48 498
5 5 66 6
and fill each grid 'square' with data the point location of each crime occurrence, hopefully using the over() function from the sp package.
I have tried solving this problem for weeks, and I can't figure it out. I have looked for an easy solution, but I can't seem to find it. Any help would be appreciated.
Additionally, here's an sf and tidyverse-based solution:
With sf, you can make a grid of squares with the st_make_grid() function. Here I'll make a 2km grid over San Jose's bounding box, then intersect it with the boundary of San Jose. Note that I'm projecting to UTM zone 10N so I can specify the grid size in meters.
library(tigris)
library(tidyverse)
library(sf)
options(tigris_class = "sf", tigris_use_cache = TRUE)
set.seed(1234)
sj <- places("CA", cb = TRUE) %>%
filter(NAME == "San Jose") %>%
st_transform(26910)
g <- sj %>%
st_make_grid(cellsize = 2000) %>%
st_intersection(sj) %>%
st_cast("MULTIPOLYGON") %>%
st_sf() %>%
mutate(id = row_number())
Next, we can generate some random crime data with st_sample() and plot it to see what we are working with.
thefts <- st_sample(sj, size = 500) %>%
st_sf()
assaults <- st_sample(sj, size = 200) %>%
st_sf()
plot(g$geometry)
plot(thefts, add = TRUE, col = "red")
Crime data can then be joined to the grid spatially with st_join(). We can plot to check our results.
theft_grid <- g %>%
st_join(thefts) %>%
group_by(id) %>%
summarize(num_thefts = n())
plot(theft_grid["num_thefts"])
We can then do the same with the assaults data, then join the two datasets together to get the desired result. If you had a lot of crime datasets, these could be modified to work within some variation of purrr::map().
assault_grid <- g %>%
st_join(assaults) %>%
group_by(id) %>%
summarize(num_assaults = n())
st_geometry(assault_grid) <- NULL
crime_data <- left_join(theft_grid, assault_grid, by = "id")
crime_data
Simple feature collection with 190 features and 3 fields
geometry type: GEOMETRY
dimension: XY
bbox: xmin: 584412 ymin: 4109499 xmax: 625213.2 ymax: 4147443
epsg (SRID): 26910
proj4string: +proj=utm +zone=10 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
# A tibble: 190 x 4
id num_thefts num_assaults geometry
<int> <int> <int> <GEOMETRY [m]>
1 1 2 1 POLYGON ((607150.3 4111499, 608412 4111499, 608412 4109738,…
2 2 4 1 POLYGON ((608412 4109738, 608412 4111499, 609237.8 4111499,…
3 3 3 1 POLYGON ((608412 4113454, 608412 4111499, 607150.3 4111499,…
4 4 2 2 POLYGON ((609237.8 4111499, 608412 4111499, 608412 4113454,…
5 5 1 1 MULTIPOLYGON (((610412 4112522, 610412 4112804, 610597 4112…
6 6 1 1 POLYGON ((616205.4 4113499, 616412 4113499, 616412 4113309,…
7 7 1 1 MULTIPOLYGON (((617467.1 4113499, 618107.9 4113499, 617697.…
8 8 2 1 POLYGON ((605206.8 4115499, 606412 4115499, 606412 4114617,…
9 9 5 1 POLYGON ((606412 4114617, 606412 4115499, 608078.2 4115499,…
10 10 1 1 POLYGON ((609242.7 4115499, 610412 4115499, 610412 4113499,…
# ... with 180 more rows
With a Spatial* object, as your data
library(tigris)
ca_cities = tigris::places(state = "CA") #using tigris package to get shape file of all CA cities
sj = ca_cities[ca_cities$NAME == "San Jose",] #specifying to San Jose
sjutm = sp::spTransform(sj, CRS("+proj=utm +zone=10 +datum=WGS84"))
You can make a grid of polygons like this
library(raster)
r <- raster(sjutm, ncol=25, nrow=25)
rp <- as(r, 'SpatialPolygons')
Show it
plot(sjutm, col='red')
lines(rp, col='blue')
To count the number of cases per grid cell (using some random points here) you do not want to use the polygons but rather the RasterLayer
set.seed(0)
x <- runif(500, xmin(r), xmax(r))
y <- runif(500, ymin(r), ymax(r))
xy1 <- cbind(x, y)
x <- runif(500, xmin(r), xmax(r))
y <- runif(500, ymin(r), ymax(r))
xy2 <- cbind(x, y)
d1 <- rasterize(xy1, r, fun="count", background=0)
d2 <- rasterize(xy2, r, fun="count", background=0)
plot(d1)
plot(sjutm, add=TRUE)
Followed by
s <- stack(d1, d2)
names(s) = c("assault", "theft")
s <- mask(s, sjutm)
plot(s, addfun=function()lines(sjutm))
To get the table you are after
p <- rasterToPoints(s)
cell <- cellFromXY(s, p[,1:2])
res <- data.frame(grid_id=cell, p[,3:4])
head(res)
# grid_id assault theft
#1 1 1 1
#2 2 0 1
#3 3 0 3
#4 5 1 1
#5 6 1 0
#6 26 0 0
You can also create a SpatialPolygonsDataFrame from the results
pp <- as(s, 'SpatialPolygonsDataFrame')
pp
#class : SpatialPolygonsDataFrame
#features : 190
#extent : 584411.5, 623584.9, 4109499, 4147443 (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=utm +zone=10 +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
#variables : 2
#names : assault, theft
#min values : 0, 0
#max values : 4, 5
If your goal is only the visual, and not necessarily all the grid-aggregation code and data you can generate an interactive map and grid in library(mapdeck) (noting you'll need a Mapbox access token)
The first step to generate the data is borrowed from #kwalkertcu 's answer
library(tigris)
library(sf)
options(tigris_class = "sf", tigris_use_cache = TRUE)
set.seed(1234)
sj <- places("CA", cb = TRUE) %>%
filter(NAME == "San Jose") %>%
st_transform(26910)
thefts <- st_sample(sj, size = 500) %>%
st_sf() %>%
st_transform(crs = 4326)
## some random weight data
thefts$weight <- sample(1:100, size = nrow(thefts), replace = T)
Then, given a sf object with a weight column you can plot it using add_screengrid()
library(mapdeck)
set_token("MAPBOX_TOKEN")
mapdeck(
style = mapdeck_style("dark")
, location = c(-121.8, 37.3)
, zoom = 6
) %>%
add_screengrid(
data = thefts
, cell_size = 15
, weight = "weight"
)
Notes:
I'm using the github version of mapdeck where the API has changed slightly, but the CRAN version should yield the same result.