Related
So I'm trying to plot a bunch of coordinates on the earth and track how many coordinates are in each country. I have plotted the map and coordinates fine, but when I try to use the intersection to count how many coordinates fall within each country (polygon) it results in error. I've tried using the st_make_valid function to fix the earth shape file but it messes up the geometry. I'm new to using R so any help would be greatly appreciated.
I have used the following code to plot the earth shape file and the coordinates on top:
library(tidyverse)
library(sf)
library(rmapshaper)
library(rnaturalearth)
library(rnaturalearthdata)
library(sp)
library(raster)
###############
# Load Data
###############
# Read in data from .csv file
MeteoriteData <- read.csv("C:/Users/ChaseDickson_/Desktop/College/AERO 689/Semester Project/Meteorite Landings.csv")
# Convert these points to an SF object, specifying the X and Y
# column names, and supplying the CRS as 4326 (which is WGS84)
MeteoriteData.sf <- st_as_sf(MeteoriteData, coords=c('long', 'lat'), crs=4326)
world <- (ne_countries(scale = "medium", returnclass = "sf"))
MeteoriteMap <- ggplot(data = world) +
geom_sf() +
geom_sf(data = MeteoriteData.sf, size = 0.5, shape = 23, fill = "darkred") +
theme_bw()
MeteoriteMap
Which gives the following plot
However, when getting the intersection of the code I used this
intersection <- st_intersection(x = world, y = MeteoriteData.sf)
But it gave the error
Error in wk_handle.wk_wkb(wkb, s2_geography_writer(oriented = oriented, :
Loop 96 is not valid: Edge 743 crosses edge 998
To fix this I changed the world sf by adding st_make_valid like so:
world <- st_make_valid(ne_countries(scale = "small", returnclass = "sf"))
Now this allows the intersection function to work as such:
intersection <- st_intersection(x = world, y = MeteoriteData.sf)
int_result <- intersection %>%
group_by(sovereignt) %>%
count()
And the output is recorded shown below
However, this messes the countries (polygons) up in the plot and will give inaccurate data as the new earth shape file is shown below:
Any help figuring out how to maintain the first plot, but still get the intersection function and count to work after adding st_make_valid would be greatly appreciated!
The {rnaturalearth} package has had a long and productive history, but - kind of like the similar {maps} package - it belongs to a different, less demanding era. You should consider doing a Marie Kondo on it: thank it for its service, and let it go.
So instead of trying to repair its failings look for a different instance of world dataset, which is a very common and thus standardized use case.
Consider this piece of code, and note that it is not a single piece of wrong geometry, but 6 (out of 241). Correcting them one by one would be a fruitless task.
library(sf)
rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")|>
st_is_valid() |>
table()
# FALSE TRUE
# 6 235
My preferred source of the world country data is the {giscoR} package, which interfaces GISCO spatial dataset, ultimately maintained by Eurostat.
It is very handy, known to be valid and actively maintained.
giscoR::gisco_get_countries(resolution = "20") |>
st_is_valid() |>
table()
# TRUE
# 257
The rest of your code - the intersection and plotting part - should work just fine once you get rid of invalid geometries.
Just a note regarding st_intersection() (and shamelessly building on top of Jindra's answer.. ) -
this happens to be an extremely expensive method to retrieve hit count, especially if resulting geometries are just a by-product that will be disregarded (and it's expensive for this too, check st_join())
For counting you can save a lot (like ~250ms vs ~16s(!) for 100 random points on my machine) by opting for st_intersects():
library(sf)
library(ggplot2)
countries <- giscoR::gisco_get_countries(resolution = "20")
set.seed(1)
random_points <- data.frame(x = runif(100,-180,180), y = runif(100,-90,90)) |>
st_as_sf(coords = c("x","y"), crs = "WGS84")
#> Measure st_intersection():
system.time({
countries_isect <- st_intersection(countries, random_points)
})
#> user system elapsed
#> 15.97 0.47 16.50
#> Measure st_intersects():
system.time({
countries$hits <- lengths(st_intersects(countries, random_points))
})
#> user system elapsed
#> 0.22 0.00 0.22
ggplot(countries) +
geom_sf(data = countries) +
geom_sf(data = random_points) +
theme_void()
Results, i.e. non-zero lengths(st_intersects(countries, random_points)):
countries[countries$hits > 0,c("NAME_ENGL","hits")] |> st_drop_geometry()
#> NAME_ENGL hits
#> 4 Antarctica 7
#> 13 Australia 1
#> 30 Brazil 4
#> 31 China 3
#> 48 Greenland 1
#> 55 Canada 3
#> 116 Kazakhstan 1
#> 117 Laos 1
#> 125 Cambodia 1
#> 141 Mauritania 1
#> 155 Oman 1
#> 162 Paraguay 1
#> 184 Mali 1
#> 185 Russian Federation 5
#> 221 United States 3
#> 223 Venezuela 1
#> 249 Thailand 1
Created on 2022-11-23 with reprex v2.0.2
The goal is to average points together within 10 meters without repeating any points in the averaging, reduce the point dataframe to the averaged points, and to ideally obtain a smooth flow of points along the routes said points were collected. Here is an 11 point subset example dataframe from a much larger file (25,000 observations):
library(sf)
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)),
datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,
coords = c("lon", "lat"),
crs = projcrs)
Here is what I've tried:
Many iterations of st_is_within_distance(trav, trav, tolerance) including:
an aggregate method shown here. These don't work because the same points get averaged multiple times.
Got close with filter and across by trying to dynamically update a list in lapply but didn't work out in the end.
This is helpful from #jeffreyevans, but doesn't really solve the problem and is a bit outdated.
The spThin package doesn't work because it's made for more specific variables.
I thought to cluster using this post, but the clusters throw random points and don't actually reduce the dataframe efficiently.
Here is as close as I've gotten. Again, the issue with this solution is it repeats points in collecting averages, which gives more weight to certain points than others.
# first set tolerance
tolerance <- 20 # 20 meters
# get distance between points
i <- st_is_within_distance(df, df, tolerance)
# filter for indices with more than 1 (self) neighbor
i <- i[which(lengths(i) > 1)]
# filter for unique indices (point 1, 2 / point 2, 1)
i <- i[!duplicated(i)]
# points in `sf` object that have no neighbors within tolerance
no_neighbors <- trav[!(1:nrow(df) %in% unlist(i)), ]
# iterate over indices of neighboring points
avg_points <- lapply(i, function(b){
df <- df[unlist(b), ]
coords <- st_coordinates(df)
df <- df %>%
st_drop_geometry() %>%
cbind(., coords)
df_sum <- df %>%
summarise(
datetime = first(datetime),
trait = mean(trait),
X = mean(X),
Y = mean(Y),
.groups = 'drop') %>%
ungroup()
return(df)
}) %>%
bind_rows() %>%
st_as_sf(coords = c('X', 'Y'),
crs = "+proj=longlat +datum=WGS84 +no_defs ")
Another answer, using sf::aggregate() and a hexagonal grid to find points that are within a particular distance from each other. A square grid could be used as well. Results will vary some depending on where exactly the grid falls in relation to the points, but no point should be used more than once in determining the mean.
Outline of the steps:
load data, transform to crs 5070 for measurements in meters
get a bounding box of the data
make a grid of hexagons of the bounding box of ~10m diameter each
aggregate points falling in the same hexagon using mean
join to original data
library(sf)
library(tidyverse)
set.seed(22) # might be needed to get same hex grid?
#### your sample data
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)),
datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,
coords = c("lon", "lat"),
crs = projcrs) %>%
st_transform(5070) ### transform to 5070 for a projection in meters
#### end sample data
# Get a bounding box as an sf object to make a grid
bbox <- st_bbox(df) %>% st_as_sfc()
# Make a grid as hexagons with approximately the right size
# area ~86m; side ~5.75m; long diag ~11.5m
hex_grid <- st_make_grid(bbox, cellsize = 10, square = F) %>% st_as_sf()
# Aggregate mean of the hexagonal grid
hex_agg <- aggregate(df ,
hex_grid,
mean,
join = st_contains) %>% filter(!is.na(trait))
# Assign the mean of the hexagon to points that fall
# within each hexagon
df_agg <- st_join(df, hex_agg)
head(df_agg) # trait.x from df, trait.y from the mean by hexagon
#> Simple feature collection with 6 features and 4 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 121281.6 ymin: 1786179 xmax: 121285.3 ymax: 1786227
#> Projected CRS: NAD83 / Conus Albers
#> trait.x datetime.x trait.y datetime.y
#> 1 91.22 2021-08-06 15:08:43 91.70500 2021-08-06 15:26:00
#> 2 91.22 2021-08-06 15:08:44 91.32667 2021-08-06 15:31:47
#> 3 91.22 2021-08-06 15:08:46 91.22000 2021-08-06 15:08:46
#> 4 91.58 2021-08-06 15:08:47 91.58000 2021-08-06 15:08:47
#> 5 91.47 2021-08-06 15:43:17 91.47000 2021-08-06 15:43:17
#> 6 92.19 2021-08-06 15:43:18 91.70500 2021-08-06 15:26:00
#> geometry
#> 1 POINT (121282.5 1786184)
#> 2 POINT (121283.2 1786194)
#> 3 POINT (121284.6 1786216)
#> 4 POINT (121285.3 1786227)
#> 5 POINT (121281.7 1786179)
#> 6 POINT (121281.6 1786186)
sum(df_agg$trait.x) - sum(df_agg$trait.y) # original trait - aggregate trait should be 0, or near 0
#> [1] 0
ggplot(df_agg) +
geom_sf(aes(size = trait.x), alpha = .2, color = 'blue') + # Original triat
geom_sf(aes(size = trait.y), alpha = .2, color = 'red') + # New aggregated trait
theme_void()
Sized by trait. Blue points are original, red is the new spatial mean.
## Plot of
# original points & hex grid used:
ggplot() +
geom_sf(data = df, color = 'red') +
geom_sf(data = hex_grid, fill = NA) +
theme_void()
Plot showing the grouping of the points for the mean. Looks like there were groups of 1,2, and 3 points per hexagon for the mean.
Created on 2022-03-23 by the reprex package (v2.0.1)
Edit
Updated to have only one point per hexagon, losing some of the original points
## Edit for one point per hexagon:
hex_agg <- aggregate(df ,
hex_grid,
mean,
join = st_contains) %>% filter(!is.na(trait)) %>%
rownames_to_column('hex_num') # add hexagon number to group_by
## Guide to join on, has only hexagon number & centroid of contained points
hex_guide <- df_agg %>%
group_by(hex_num) %>%
summarise() %>%
st_centroid()
# The full sf object with only one point per hexagon
# this join isn't the most efficient, but slice(1) removes
# the duplicate data. You could clean df_agg before the join
# to resolve this
final_join <- df_agg %>%
st_drop_geometry() %>%
left_join(hex_guide, by = 'hex_num') %>%
group_by(hex_num) %>%
slice(1) %>%
ungroup() %>%
st_as_sf()
ggplot() +
geom_sf(data = final_join, color = 'red', size = 3) +
geom_sf(data = df, color = 'black', alpha = .5) +
geom_sf(data = hex_grid, color = 'blue', fill = NA)
The plot shows the hexagons, original data points in grey, and new red points at the centroid of grouped original points. Only 1 red point per hexagon.
I'm not sure, but perhaps this is what you are looking for?
You can experiment with the different settings/methods of smoothr::smooth() to get the desired results.
library(tidyverse)
library(igraph)
library(smoothr)
library(mapview) # for viewing purposes only
# get a matrix of points <10 meter apart
m <- st_is_within_distance(df, dist = 10, sparse = FALSE)
# creata an igraph from the matrix
g <- graph.adjacency(m, mode="undirected", diag = FALSE)
plot(g)
points that are are withing 10 metres of eachother?
# pass cluster-number to df object
df$id <- as.vector(components(G)$membership)
# create polylines (only if more than 1 point!)
df.lines <- df %>%
group_by(id) %>%
dplyr::add_tally() %>%
dplyr::filter(n > 1) %>%
summarise(geometry = st_combine(geometry)) %>%
st_cast("LINESTRING") %>%
# create smooth lines
smoothr::smooth(method = "ksmooth")
#view points and lines
mapview::mapview(list(df, df.lines))
If I understand your problem correctly, all boils down to selecting the "right" neighbors, i.e. those within a certain neighborhood, which were not used yet. If there is no such neighbor, simply use the point itself (even if it was used already in the averaging for another point).
Here's a solution using purrr::accumulate to first produce the correct indices and then simply use these indices to do the averaging:
library(purrr)
library(dplyr)
idx <- accumulate(i[-1L], function(x, y) {
x$points <- setdiff(y, x$used)
x$used <- union(x$used, y)
x
}, .init = list(used = i[[1L]], points = i[[1L]]))
idx[1:4]
# [[1]]
# [[1]]$used
# [1] 1 2 5 6 7 8 9
#
# [[1]]$points
# [1] 1 2 5 6 7 8 9
#
#
# [[2]]
# [[2]]$used
# [1] 1 2 5 6 7 8 9 10 11
#
# [[2]]$points
# [1] 10 11
#
#
# [[3]]
# [[3]]$used
# [1] 1 2 5 6 7 8 9 10 11 3 4
#
# [[3]]$points
# [1] 3 4
#
#
# [[4]]
# [[4]]$used
# [1] 1 2 5 6 7 8 9 10 11 3 4
#
# [[4]]$points
# integer(0)
The idea is that we maintain a list of used indices, that is, the ones which already used in any of the neighborhoods and the remainders (points). For instance, for the first point we use points at indices 1,2, 5, 6, 7, 8, 9 which leaves only indices 10, 11 for the second point. If there is no point left, we return integer(0).
Now that we have set up the indices list, the rest is easy, by looping through the list, selecting the indicated points (using the point itself in case there is no point left) and doing the avering:
idx %>%
imap_dfr(function(x, y) {
if (!length(x$points)) {
idx <- y
} else {
idx <- x$points
}
df[idx, , drop = FALSE] %>%
bind_cols(st_coordinates(.) %>% as_tibble()) %>%
st_drop_geometry() %>%
summarise(datetime = first(datetime),
trait = mean(trait),
X = mean(X),
Y = mean(Y))
}) %>%
st_as_sf(coords = c('X', 'Y'),
crs = "+proj=longlat +datum=WGS84 +no_defs ")
# Simple feature collection with 11 features and 2 fields
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: -94.58476 ymin: 39.09248 xmax: -94.58459 ymax: 39.09291
# CRS: +proj=longlat +datum=WGS84 +no_defs
# First 10 features:
# datetime trait geometry
# 1 2021-08-06 15:08:43 91.34714 POINT (-94.58464 39.09259)
# 2 2021-08-06 15:43:22 91.65000 POINT (-94.58473 39.09274)
# 3 2021-08-06 15:08:46 91.40000 POINT (-94.5846 39.09286)
# 4 2021-08-06 15:08:47 91.58000 POINT (-94.58459 39.09291)
# 5 2021-08-06 15:43:17 91.47000 POINT (-94.58464 39.09248)
# 6 2021-08-06 15:43:18 92.19000 POINT (-94.58464 39.09255)
# 7 2021-08-06 15:43:19 92.19000 POINT (-94.58464 39.09261)
# 8 2021-08-06 15:43:20 90.57000 POINT (-94.58464 39.09266)
# 9 2021-08-06 15:43:21 90.57000 POINT (-94.58466 39.0927)
# 10 2021-08-06 15:43:22 91.65000 POINT (-94.5847 39.09273)
If the goal is to not weight any point more than any other point in the cluster averages, it would be more balanced to use weighted averages rather than trying to force each cluster to contain a set of points unique from all other clusters.
One way to think of the below methodology is to "chop up" each observation and divvy up the pieces into clusters in such a way that the weight of the pieces in each cluster sums to 1.
This will probably be too expensive for 25k observations, so one option could be to perform this on overlapping or non-overlapping segments and stitch them together.
library(sf)
library(Rfast) # for the 'eachrow' function
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)),
datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,
coords = c("lon", "lat"),
crs = projcrs)
n <- nrow(df)
# sum the trait column for a sanity check after calculations
sumtrait <- sum(df$trait)
# first set tolerance
tolerance <- 20 # 20 meters
tol <- 1e-5 # tolerance for the weight matrix marginal sums
# create clusters of points grouped by circles centered at each point
i <- st_is_within_distance(df, df, tolerance)
# Initialize a matrix for the weight of each point within each cluster. The
# initial value represents an unweighted average for each cluster, so the row
# sums are not necessarily 1.
sz <- lengths(i)
w <- replace(matrix(0, n, n), unlist(sapply(1:n, function(x) i[[x]] + n*(x - 1))), rep.int(1/sz, sz))
# iteratively adjust the weights until the marginal sums all equal 1 (within
# tolerance)
marg <- rowSums(w)
while (max(abs(marg - 1)) > tol) {
w <- w/marg
marg <- colSums(w)
w <- eachrow(w, marg, "/")
marg <- rowSums(w)
}
df$trait <- colSums(w*df$trait)
print(df, n = nrow(df))
#> Simple feature collection with 11 features and 2 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -94.58476 ymin: 39.09248 xmax: -94.58459 ymax: 39.09291
#> CRS: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
#> trait datetime geometry
#> 1 91.37719 2021-08-06 15:08:43 POINT (-94.58463 39.09253)
#> 2 91.44430 2021-08-06 15:08:44 POINT (-94.58462 39.09262)
#> 3 91.31374 2021-08-06 15:08:46 POINT (-94.5846 39.09281)
#> 4 91.46755 2021-08-06 15:08:47 POINT (-94.58459 39.09291)
#> 5 91.64053 2021-08-06 15:43:17 POINT (-94.58464 39.09248)
#> 6 91.37719 2021-08-06 15:43:18 POINT (-94.58464 39.09255)
#> 7 91.44430 2021-08-06 15:43:19 POINT (-94.58464 39.09261)
#> 8 91.41380 2021-08-06 15:43:20 POINT (-94.58464 39.09266)
#> 9 91.41380 2021-08-06 15:43:21 POINT (-94.58466 39.0927)
#> 10 91.31880 2021-08-06 15:43:22 POINT (-94.5847 39.09273)
#> 11 91.31880 2021-08-06 15:43:23 POINT (-94.58476 39.09274)
# check that the sum of the "traits" column is unchanged
sum(df$trait) - sumtrait
#> [1] 4.875536e-07
UPDATE: If an exclusive grouping method is really needed, this implements a greedy algorithm:
avg_points <- numeric(nrow(df))
clusters <- vector("list", nrow(df))
currclust <- 0L
df$unused <- TRUE
for (cl in seq_along(df)) {
if (sum(df$unused[i[[cl]]])) {
currclust <- currclust + 1L
avg_points[currclust] <- mean(df$trait[i[[cl]]][df$unused[i[[cl]]]])
clusters[[currclust]] <- i[[cl]][df$unused[i[[cl]]]]
df$unused[i[[cl]]] <- FALSE
}
}
avg_points <- avg_points[1:currclust]
clusters <- clusters[1:currclust]
avg_points
#> [1] 91.34714 91.65000 91.40000
clusters
#> [[1]]
#> [1] 1 2 5 6 7 8 9
#>
#> [[2]]
#> [1] 10 11
#>
#> [[3]]
#> [1] 3 4
Note that the issue of uneven weightings is still present--the observations in group 1 are each weighted 1/7, while the observations in groups 2 and 3 are each weighted 1/2.
Is there a fast way to convert latitude and longitude coordinates to State codes in R? I've been using the zipcode package as a look up table but it's too slow when I'm querying lots of lat/long values
If not in R is there any way to do this using google geocoder or any other type of fast querying service?
Thanks!
Here are two options, one using sf and one using sp package functions. sf is the more modern (and, here in 2020, recommended) package for analyzing spatial data, but in case it's still useful, I am leaving my original 2012 answer showing how to do this with sp-related functions.
Method 1 (using sf):
library(sf)
library(spData)
## pointsDF: A data.frame whose first column contains longitudes and
## whose second column contains latitudes.
##
## states: An sf MULTIPOLYGON object with 50 states plus DC.
##
## name_col: Name of a column in `states` that supplies the states'
## names.
lonlat_to_state <- function(pointsDF,
states = spData::us_states,
name_col = "NAME") {
## Convert points data.frame to an sf POINTS object
pts <- st_as_sf(pointsDF, coords = 1:2, crs = 4326)
## Transform spatial data to some planar coordinate system
## (e.g. Web Mercator) as required for geometric operations
states <- st_transform(states, crs = 3857)
pts <- st_transform(pts, crs = 3857)
## Find names of state (if any) intersected by each point
state_names <- states[[name_col]]
ii <- as.integer(st_intersects(pts, states))
state_names[ii]
}
## Test the function with points in Wisconsin, Oregon, and France
testPoints <- data.frame(x = c(-90, -120, 0), y = c(44, 44, 44))
lonlat_to_state(testPoints)
## [1] "Wisconsin" "Oregon" NA
If you need higher resolution state boundaries, read in your own vector data as an sf object using sf::st_read() or by some other means. One nice option is to install the rnaturalearth package and use it to load a state vector layer from rnaturalearthhires. Then use the lonlat_to_state() function we just defined as shown here:
library(rnaturalearth)
us_states_ne <- ne_states(country = "United States of America",
returnclass = "sf")
lonlat_to_state(testPoints, states = us_states_ne, name_col = "name")
## [1] "Wisconsin" "Oregon" NA
For very accurate results, you can download a geopackage containing GADM-maintained administrative borders for the United States from this page. Then, load the state boundary data and use them like this:
USA_gadm <- st_read(dsn = "gadm36_USA.gpkg", layer = "gadm36_USA_1")
lonlat_to_state(testPoints, states = USA_gadm, name_col = "NAME_1")
## [1] "Wisconsin" "Oregon" NA
Method 2 (using sp):
Here is a function that takes a data.frame of lat-longs within the lower 48 states, and for each point, returns the state in which it is located.
Most of the function simply prepares the SpatialPoints and SpatialPolygons objects needed by the over() function in the sp package, which does the real heavy lifting of calculating the 'intersection' of points and polygons:
library(sp)
library(maps)
library(maptools)
# The single argument to this function, pointsDF, is a data.frame in which:
# - column 1 contains the longitude in degrees (negative in the US)
# - column 2 contains the latitude in degrees
lonlat_to_state_sp <- function(pointsDF) {
# Prepare SpatialPolygons object with one SpatialPolygon
# per state (plus DC, minus HI & AK)
states <- map('state', fill=TRUE, col="transparent", plot=FALSE)
IDs <- sapply(strsplit(states$names, ":"), function(x) x[1])
states_sp <- map2SpatialPolygons(states, IDs=IDs,
proj4string=CRS("+proj=longlat +datum=WGS84"))
# Convert pointsDF to a SpatialPoints object
pointsSP <- SpatialPoints(pointsDF,
proj4string=CRS("+proj=longlat +datum=WGS84"))
# Use 'over' to get _indices_ of the Polygons object containing each point
indices <- over(pointsSP, states_sp)
# Return the state names of the Polygons object containing each point
stateNames <- sapply(states_sp#polygons, function(x) x#ID)
stateNames[indices]
}
# Test the function using points in Wisconsin and Oregon.
testPoints <- data.frame(x = c(-90, -120), y = c(44, 44))
lonlat_to_state_sp(testPoints)
[1] "wisconsin" "oregon" # IT WORKS
You can do it in a few lines of R.
library(sp)
library(rgdal)
#lat and long
Lat <- 57.25
Lon <- -9.41
#make a data frame
coords <- as.data.frame(cbind(Lon,Lat))
#and into Spatial
points <- SpatialPoints(coords)
#SpatialPolygonDataFrame - I'm using a shapefile of UK counties
counties <- readOGR(".", "uk_counties")
#assume same proj as shapefile!
proj4string(points) <- proj4string(counties)
#get county polygon point is in
result <- as.character(over(points, counties)$County_Name)
See ?over in the sp package.
You'll need to have the state boundaries as a SpatialPolygonsDataFrame.
Example data (polygons and points)
library(raster)
pols <- shapefile(system.file("external/lux.shp", package="raster"))
xy <- coordinates(p)
Use raster::extract
extract(p, xy)
# point.ID poly.ID ID_1 NAME_1 ID_2 NAME_2 AREA
#1 1 1 1 Diekirch 1 Clervaux 312
#2 2 2 1 Diekirch 2 Diekirch 218
#3 3 3 1 Diekirch 3 Redange 259
#4 4 4 1 Diekirch 4 Vianden 76
#5 5 5 1 Diekirch 5 Wiltz 263
#6 6 6 2 Grevenmacher 6 Echternach 188
#7 7 7 2 Grevenmacher 7 Remich 129
#8 8 8 2 Grevenmacher 12 Grevenmacher 210
#9 9 9 3 Luxembourg 8 Capellen 185
#10 10 10 3 Luxembourg 9 Esch-sur-Alzette 251
#11 11 11 3 Luxembourg 10 Luxembourg 237
#12 12 12 3 Luxembourg 11 Mersch 233
It's very straightforward using sf:
library(maps)
library(sf)
## Get the states map, turn into sf object
US <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
## Test the function using points in Wisconsin and Oregon
testPoints <- data.frame(x = c(-90, -120), y = c(44, 44))
# Make it a spatial dataframe, using the same coordinate system as the US spatial dataframe
testPoints <- st_as_sf(testPoints, coords = c("x", "y"), crs = st_crs(US))
#.. and perform a spatial join!
st_join(testPoints, US)
ID geometry
1 wisconsin POINT (-90 44)
2 oregon POINT (-120 44)
The county shape file is only available as a national shape file (2017 TIGER/Line® Shapefiles: Counties (and equivalent), https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2017&layergroup=Counties+%28and+equivalent%29
I want to select just one state (e.g. Pennsylvania). So,
# read county polygons
counties <- readOGR(dsn="tl_2017_us_county", layer="tl_2017_us_county")
# subset to PA counties
PA_counties <- subset(counties, counties#data$STATEFP == "42")
HOWEVER, when I try and create a data frame and map, I'm getting the error:
Error in FUN(X[[i]], ...) : object 'lon' not found
# create a data frame
PA_counties.df <- as.data.frame(PA_counties)
PA_counties.dfFORT <-fortify(PA_counties.df, region = "GEOID")
gg<-ggplot()
gg <- gg +geom_polygon(data =PA_counties.dfFORT, aes(x=lon, y=lat, group=group,
fill=NA), color = "blue", fill=NA, size = 0.5
gg <- gg +coord_map()
gg
Help? I'm hoping to create this map; merge the data with another file by GeoID, and fill some of the counties (e.g. for GeoID xxx if =1 then fill with blue, etc).
This must be a very common mapping use case scenario? Grateful for any tips?
Best,
Lori
I did some troubleshooting that I encourage you to do as well. First, look at the names of your fortified data frame: you have columns such as INTPLAT, INTPLON, and GROUP, instead of lat, long, and group.
When you call as.data.frame on the spatial data frame and then call fortify, you're not getting what I think you would expect. If you take a closer look at the output you get from these two functions, it seems to be centroids or some other point, such that you have just one point per county, and the coordinates are factors, not numbers. You instead need to call fortify on the spatial data frame itself. You should expect a data frame with thousands of rows, because there are many points needed to make up the polygon shape of each county.
Note that I used tigris::counties to get the shapefile because I couldn't read a downloaded file to make the reprex, but I'm pretty sure the shapefile is identical.
library(tidyverse)
library(sf)
library(rgdal)
counties <- tigris::counties(cb = T)
# counties <- readOGR(dsn="tl_2017_us_county", layer="tl_2017_us_county")
PA_counties <- subset(counties, counties#data$STATEFP == "42")
PA_counties.dfFORT <- fortify(PA_counties, region = "GEOID")
names(PA_counties.dfFORT)
#> [1] "long" "lat" "order" "hole" "piece" "id" "group"
Then you can use geom_polygon as you expected:
ggplot(PA_counties.dfFORT, aes(x = long, y = lat, group = group)) +
geom_polygon(fill = NA, color = "blue") +
coord_map()
An easier and more flexible way is to use sf. Again, you can use sf::read_sf on the shapefile; for making a reprex, I called sf::st_as_sf on the spatial data frame I'd gotten with tigris. sf lets you use dplyr-style functions to do operations like filtering, adding columns, and calculating summaries.
counties_sf <- st_as_sf(counties)
# counties_sf <- read_sf("tl_2017_us_county")
pa_counties_sf <- counties_sf %>%
filter(STATEFP == "42")
head(pa_counties_sf)
#> Simple feature collection with 6 features and 9 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -80.36087 ymin: 39.72002 xmax: -74.7215 ymax: 40.74368
#> epsg (SRID): 4269
#> proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
#> STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD
#> 1 42 003 01213657 0500000US42003 42003 Allegheny 06
#> 2 42 017 01209173 0500000US42017 42017 Bucks 06
#> 3 42 041 01209176 0500000US42041 42041 Cumberland 06
#> 4 42 055 01213670 0500000US42055 42055 Franklin 06
#> 5 42 061 01213672 0500000US42061 42061 Huntingdon 06
#> 6 42 071 01209181 0500000US42071 42071 Lancaster 06
#> ALAND AWATER geometry
#> 1 1890889706 37411488 MULTIPOLYGON (((-80.36078 4...
#> 2 1565408723 45201371 MULTIPOLYGON (((-75.48406 4...
#> 3 1412834155 12521844 MULTIPOLYGON (((-77.62503 4...
#> 4 2000052118 1544300 MULTIPOLYGON (((-78.09914 3...
#> 5 2265337403 37851955 MULTIPOLYGON (((-78.2567 40...
#> 6 2444606396 103423673 MULTIPOLYGON (((-76.72162 4...
sf is also very easy to use with ggplot because as of ggplot2 version 3.0.0 released a few weeks ago, ggplot ships with a geom_sf function that reads your sf object's geometry column automatically. You can go straight from reading a shapefile to plotting it.
ggplot(pa_counties_sf) +
geom_sf(fill = NA, color = "blue")
As an example of how you can then merge in another dataset with your shape, I made some dummy data with a letter for each county, then just called dplyr::left_join to join it with the sf data frame, then plotted. Hope that helps get you started!
pa_data <- data_frame(
GEOID = pa_counties_sf$GEOID
) %>%
mutate(type = sample(c("A", "B"), size = nrow(.), replace = T))
pa_counties_sf %>%
left_join(pa_data, by = "GEOID") %>%
ggplot() +
geom_sf(aes(fill = type), color = "blue")
Created on 2018-07-15 by the reprex package (v0.2.0).
Sf package and the geom_sf function from ggplot2 make this task quite simple.
library(sf)
counties <- st_read("tl_2017_us_county")
# subset to PA counties
# note with sf you can treat it just like a data.frame and there is no need to call #data
PA_counties <- counties[which(counties$STATEFP == "42"),]
# use geom_sf to plot an sf object easily
ggplot() + geom_sf(data = PA_counties)
I used the feedback above and a couple other sites for reference ( also switched to census tracts). Two key things that weren't obvious to a newbie 1) it helps to switch the GeoID to char 2) fortify hasn't worked probably if the names aren't "long" "lat" "group" etc.
# Percent reporting no participation in leisure time physical activity
# https://chronicdata.cdc.gov/500-Cities/500-Cities-Census-Tract-level-Data- GIS-Friendly-Fo/k86t-wghb/data
# New York State census tracts
# https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2018&layergroup=Census+Tracts
library(rgdal)
library(ggplot2)
library(dplyr)
library(ggplot2)
library(maptools)
census <- readOGR(dsn="tl_2018_36_tract", layer="tl_2018_36_tract")
nyc_census <- subset(census, census#data$COUNTYFP %in% c("005", "061", "047", "081","085"))
nyc_census#data$GEOID<-as.character(nyc_census#data$GEOID)
study <- fortify(nyc_census, region = "GEOID")
lpa <- read.csv(file ="500Cities.csv", header=TRUE, sep=",")
lpa <- lpa[, c(3,45,46)]
lpa <-mutate(lpa, id=as.character(TractFIPS),
LPA_CrudePrev =as.numeric(LPA_CrudePrev),
lpa_percent =(LPA_CrudePrev/100))
study2 <- left_join(study, lpa, by=c("id"))
ggplot() +
geom_polygon(data =study2, aes(x=long, y=lat, group = group, fill=lpa_percent), color="grey50") +
scale_fill_gradientn(colours = c("red", "white", "cadetblue"),
values = c(1,0.5, .3, .2, .1, 0)) +
coord_map(xlim = c(-74.26, -73.71), ylim = c(40.49,40.92))
Is there a fast way to convert latitude and longitude coordinates to State codes in R? I've been using the zipcode package as a look up table but it's too slow when I'm querying lots of lat/long values
If not in R is there any way to do this using google geocoder or any other type of fast querying service?
Thanks!
Here are two options, one using sf and one using sp package functions. sf is the more modern (and, here in 2020, recommended) package for analyzing spatial data, but in case it's still useful, I am leaving my original 2012 answer showing how to do this with sp-related functions.
Method 1 (using sf):
library(sf)
library(spData)
## pointsDF: A data.frame whose first column contains longitudes and
## whose second column contains latitudes.
##
## states: An sf MULTIPOLYGON object with 50 states plus DC.
##
## name_col: Name of a column in `states` that supplies the states'
## names.
lonlat_to_state <- function(pointsDF,
states = spData::us_states,
name_col = "NAME") {
## Convert points data.frame to an sf POINTS object
pts <- st_as_sf(pointsDF, coords = 1:2, crs = 4326)
## Transform spatial data to some planar coordinate system
## (e.g. Web Mercator) as required for geometric operations
states <- st_transform(states, crs = 3857)
pts <- st_transform(pts, crs = 3857)
## Find names of state (if any) intersected by each point
state_names <- states[[name_col]]
ii <- as.integer(st_intersects(pts, states))
state_names[ii]
}
## Test the function with points in Wisconsin, Oregon, and France
testPoints <- data.frame(x = c(-90, -120, 0), y = c(44, 44, 44))
lonlat_to_state(testPoints)
## [1] "Wisconsin" "Oregon" NA
If you need higher resolution state boundaries, read in your own vector data as an sf object using sf::st_read() or by some other means. One nice option is to install the rnaturalearth package and use it to load a state vector layer from rnaturalearthhires. Then use the lonlat_to_state() function we just defined as shown here:
library(rnaturalearth)
us_states_ne <- ne_states(country = "United States of America",
returnclass = "sf")
lonlat_to_state(testPoints, states = us_states_ne, name_col = "name")
## [1] "Wisconsin" "Oregon" NA
For very accurate results, you can download a geopackage containing GADM-maintained administrative borders for the United States from this page. Then, load the state boundary data and use them like this:
USA_gadm <- st_read(dsn = "gadm36_USA.gpkg", layer = "gadm36_USA_1")
lonlat_to_state(testPoints, states = USA_gadm, name_col = "NAME_1")
## [1] "Wisconsin" "Oregon" NA
Method 2 (using sp):
Here is a function that takes a data.frame of lat-longs within the lower 48 states, and for each point, returns the state in which it is located.
Most of the function simply prepares the SpatialPoints and SpatialPolygons objects needed by the over() function in the sp package, which does the real heavy lifting of calculating the 'intersection' of points and polygons:
library(sp)
library(maps)
library(maptools)
# The single argument to this function, pointsDF, is a data.frame in which:
# - column 1 contains the longitude in degrees (negative in the US)
# - column 2 contains the latitude in degrees
lonlat_to_state_sp <- function(pointsDF) {
# Prepare SpatialPolygons object with one SpatialPolygon
# per state (plus DC, minus HI & AK)
states <- map('state', fill=TRUE, col="transparent", plot=FALSE)
IDs <- sapply(strsplit(states$names, ":"), function(x) x[1])
states_sp <- map2SpatialPolygons(states, IDs=IDs,
proj4string=CRS("+proj=longlat +datum=WGS84"))
# Convert pointsDF to a SpatialPoints object
pointsSP <- SpatialPoints(pointsDF,
proj4string=CRS("+proj=longlat +datum=WGS84"))
# Use 'over' to get _indices_ of the Polygons object containing each point
indices <- over(pointsSP, states_sp)
# Return the state names of the Polygons object containing each point
stateNames <- sapply(states_sp#polygons, function(x) x#ID)
stateNames[indices]
}
# Test the function using points in Wisconsin and Oregon.
testPoints <- data.frame(x = c(-90, -120), y = c(44, 44))
lonlat_to_state_sp(testPoints)
[1] "wisconsin" "oregon" # IT WORKS
You can do it in a few lines of R.
library(sp)
library(rgdal)
#lat and long
Lat <- 57.25
Lon <- -9.41
#make a data frame
coords <- as.data.frame(cbind(Lon,Lat))
#and into Spatial
points <- SpatialPoints(coords)
#SpatialPolygonDataFrame - I'm using a shapefile of UK counties
counties <- readOGR(".", "uk_counties")
#assume same proj as shapefile!
proj4string(points) <- proj4string(counties)
#get county polygon point is in
result <- as.character(over(points, counties)$County_Name)
See ?over in the sp package.
You'll need to have the state boundaries as a SpatialPolygonsDataFrame.
Example data (polygons and points)
library(raster)
pols <- shapefile(system.file("external/lux.shp", package="raster"))
xy <- coordinates(p)
Use raster::extract
extract(p, xy)
# point.ID poly.ID ID_1 NAME_1 ID_2 NAME_2 AREA
#1 1 1 1 Diekirch 1 Clervaux 312
#2 2 2 1 Diekirch 2 Diekirch 218
#3 3 3 1 Diekirch 3 Redange 259
#4 4 4 1 Diekirch 4 Vianden 76
#5 5 5 1 Diekirch 5 Wiltz 263
#6 6 6 2 Grevenmacher 6 Echternach 188
#7 7 7 2 Grevenmacher 7 Remich 129
#8 8 8 2 Grevenmacher 12 Grevenmacher 210
#9 9 9 3 Luxembourg 8 Capellen 185
#10 10 10 3 Luxembourg 9 Esch-sur-Alzette 251
#11 11 11 3 Luxembourg 10 Luxembourg 237
#12 12 12 3 Luxembourg 11 Mersch 233
It's very straightforward using sf:
library(maps)
library(sf)
## Get the states map, turn into sf object
US <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
## Test the function using points in Wisconsin and Oregon
testPoints <- data.frame(x = c(-90, -120), y = c(44, 44))
# Make it a spatial dataframe, using the same coordinate system as the US spatial dataframe
testPoints <- st_as_sf(testPoints, coords = c("x", "y"), crs = st_crs(US))
#.. and perform a spatial join!
st_join(testPoints, US)
ID geometry
1 wisconsin POINT (-90 44)
2 oregon POINT (-120 44)