Why is tidycensus area provided different that calculated by sf::_st_area() - r

I am using the tidycensus R package to pull in census data and geometries. I want to be able to calculate population densities and have the results match what I see on censusreporter.org. I am noticing a difference between the geography variables returned from tidycenus compared to what I calculate myself using the sf package sf::st_area() function.
library(tidyverse)
library(tidycensus)
census_api_key("my_api_key")
library(sf)
options(tigris_use_cache = TRUE)
pop_texas <-
get_acs(geography = 'state',
variables = "B01003_001", # Total Population
year = 2020,
survey = 'acs5',
keep_geo_vars = TRUE,
geometry = TRUE) %>%
filter(GEOID == '48') # Filter to Texas
Since I included the keep_geo_vars argument as TRUE it returned an ALAND column which I believe is the correct area for the geography returned in square meters (m^2).
> pop_texas$ALAND %>% format(big.mark=",")
[1] "676,680,588,914"
# Conversion to square miles
> (pop_texas$ALAND / 1000000 / 2.5899881) %>% format(big.mark=",")
[1] "261,267.8"
When I convert the ALAND amount to square miles I get the same number as shown on censusreporter.org:
I have also tried to calculate the area using the sf::st_area() function, but I get a different result:
> sf::st_area(pop_texas) %>% format(big.mark=",", scientific=FALSE)
[1] "688,276,954,146 [m^2]"
# Conversion to square miles
> (sf::st_area(pop_texas) / 1000000 / 2.5899881) %>%
+ as.numeric() %>%
+ format(big.mark=",", scientific=FALSE)
[1] "265,745.2"
Please let me know if there is something I am missing to reconcile these numbers. I would expect to get the same results either directly through tidycensus or calculating the area using sf::st_area().
Right now I am off by a lot:
> (pop_texas$ALAND - as.numeric(st_area(pop_texas)) ) %>%
+ format(big.mark=",")
[1] "-11,596,365,232"

If you want the "official" area of a shape like Texas you should always use the ALAND or published area value. st_area() is using geometry to calculate the area of the polygon which is always going to be a simplified and imperfect representation of Texas (or any other area). For smaller shapes (like Census tracts) the calculations will probably be pretty close; for larger shapes like states (especially those with complex coastal geography, like Texas) you're going to be further off.

These differences are usually due to the CRS (the projection used on your sf objects). Some projections distort area, other projections distors the shape. See this to learn more http://wiki.gis.com/wiki/index.php/Distortion#:~:text=There%20are%20four%20main%20types,%2C%20direction%2C%20shape%20and%20area.

Related

Identifying points located near to a polygons boundary

I am trying to identify all points (postcodes in my case) that are located near to the coastline of the UK (i.e., a polygon). I am using R to process this.
I downloaded the geographical outline of United Kingdom from here as a shapefile. A list of all postcodes for the UK were accessed from the ONS here. Please note that the latter file is very large (211MB zipped).
To begin, I loaded in both files into R, and then convert them to the same coordinate reference system (OSGB1936; 27700). For the polygon of the UK, I convert this to lines that represent the boundary/coastline (note that while Northern Ireland shares a common boundary with Ireland, I will subset any postcodes erroneously matched as near the coastline by lat/long later). I then convert the points into spatial points.
# Load libraries
library(sf)
library(data.table)
# Load data
uk_shp <- read_sf("./GBR_adm/GBR_adm0.shp") # Load UK shapefile (ignore the download file says GBR, it is UK)
uk_shp <- st_transform(uk_shp, crs = 27700) # Convert to co-ordinate reference system (CRS) that allow buffers in correct units later (note: 4326 is World CRS)
uk_coast <- st_cast(uk_shp,"MULTILINESTRING") # Convert polygon to a line (i.e., coastline)
# Load in postcodes
pcd <- fread("./ONSPD_FEB_2022_UK/Data/ONSPD_FEB_2022_UK.csv") # Load all postcodes for Great Britain - this is a very large file so I also create a single
pcd <- pcd[, c(1:3, 43:44)] # Drop unnecessary information/columns to save memory
# Convert to spatial points data frame
pcd_sp <- pcd %>% # For object of postcodes
st_as_sf(coords = c("long", "lat")) %>% # Define as spatial object and identify which columns tell us the position of points
st_set_crs(27700) # Set CRS
I originally thought the most efficient approach to take would be to define what a coastal region is (here defined as within 5km of the coastline), create a buffer to represent that around the coastline, and then use a point-in-polygon function to select all points within the buffers. However, the code below had not finished running overnight which probably suggests that it was the incorrect approach and I an unsure why it is taking so long.
uk_coast <- st_buffer(uk_coast, 5000) # Create 5km buffer
pcd_coastal <- st_intersection(uk_buf, pcd_sp) # Point-in-polygon (i.e., keep only the postcodes that are located in the buffer region)
So I changed my approach to calculate the straight-line distance of each point to the nearest coastline. In running the code below, it gives incorrect distances. For example below, I select one postcode (AB12 4XP) which is located ~2.6km from the coastline, however the code below gives ~82km which is very wrong. I had tried st_nearest_feature() but could not get it to work (it may do, but was beyond my attempts).
test <- pcd_sp[pcd_sp$pcd == "AB124XP",] # Subset test postcode
dist <- st_distance(test, uk_coast, by_element = TRUE, which = "Euclidean") # Calculate distance
I am unsure how to proceed from here - I don't think it is the wrong CRS. It might be that the multilinestring conversion is causing problems. Does anyone have suggestions what to do?
sf has an st_is_within_distance function that can test if points are within a distance of a line. My test data is 10,000 random points in the bounding box of the UK shape, and the UK shape in OSGB grid coordinates.
> system.time({indist = st_is_within_distance(uk_coast, pts, dist=5000)})
user system elapsed
30.907 0.003 30.928
But this isn't building a spatial index. The docs say that it does build a spatial index if the coordinates are "geographic" and the flag for using spherical geometry is set. I don't understand why it can't build one for cartesian coordinates, but lets see how much faster it is...
Transform takes no time at all:
> ukLL = st_transform(uk_coast, 4326)
> ptsLL = st_transform(pts, 4326)
Then test...
system.time({indistLL = st_is_within_distance(ukLL, ptsLL, dist=5000)})
user system elapsed
1.405 0.000 1.404
Just over a second. Any difference between the two? Let's see:
> setdiff(indistLL[[1]], indist[[1]])
[1] 3123
> setdiff(indist[[1]], indistLL[[1]])
integer(0)
So point 3123 is in the set using lat-long, but not the set using OSGB. There's nothing in OSGB that isn't in the lat-long set.
Quick plot to show the selected points:
> plot(uk_coast$geometry)
> plot(pts$geometry[indistLL[[1]]], add=TRUE)

Is there a way to get around this problem in my Road network analysis in R?

so I am having issues with a road network analysis I am conducting, which is aiamed to look at how far away each park entrance is from each postcode centroid in the London borough of Southwark. However, I have experienced errors in my analysis which have made all of the distances 0, which clearly shouldn't be the case. While I can't pinpoint exactly where the issue lies, one message has shown up which hasn't in past network analysis which I was successful in. I will show the code I ran and the error messages, with the packages I used to extract data
#packages loaded
library(here)
library(magrittr)
library(osmdata)
library(dodgr)
library(sf)
library(expss)
library(tmap)
# Define our bbox coordinates, here our coordinates relate to Southwark
southwark_bbox <- c(-0.121942,51.41078,-0.023347,51.508313)
# Pass our bounding box coordinates into the OverPassQuery (opq) function
osmdata <- opq(bbox = southwark_bbox) %>%
# Pipe this into the add_osm_feature data query function to extract our highways
# Note here, we specify the values we are interested in, omitting motorways
add_osm_feature(key = "highway", value = c("primary", "secondary", "tertiary", "residential",
"path", "footway", "unclassified", "living_street", "pedestrian")) %>%
# And then pipe this into our osmdata_sf object
osmdata_sf()
# Extract our spatial data into variables of their own
# Extract the points, with their osm_id.
southwark_roads_nodes <- osmdata$osm_points[, "osm_id"]
# Extract the lines, with their osm_id, name, type of highway, max speed and
# oneway attributes
southwark_roads_edges <- osmdata$osm_lines[, c("osm_id", "name", "highway", "maxspeed",
"oneway")]
#plotted just to see if Southwark was extracted properly, which it was
plot(southwark_roads_edges)
# Create network graph using are edge data, with the foot weighting profile
southwark_graph <- weight_streetnet(southwark_roads_edges)
it is with the last line of code I put in that i got a message saying "The following highway types are present in data yet lack corresponding weight_profile values: NA,". Am not entirely sure what this means or why it said that. It wasn't an error message as the object was still made, but later on in my analysis it led to each measure reading 0m which is clearly not true. If anyone can guide me on the right direction please do let me know.

How to determine if a point lies within an sf geometry that spans the dateline?

Using the R package sf, I'm trying to determine whether some points occur within the bounds of a shapefile (in this case, Hawai‘i's, EEZ). The shapefile in question can be found here. Unfortunately, the boundaries of the area in question span +/-180 longitude, which I think is what's messing me up. (I read on the sf website some business about spherical geometry in the new version, but I haven't been able to get that version to install. I think the polygons I'm dealing with are sufficiently "flat" to avoid any of those issues anyway). Part of the issue seems to be that my shapefile contains multiple geometries broken up by the dateline but I'm not sure how to combine them.
How do you tell, using sf, whether some points are inside of the bounds of some object in a shapefile (that happens to span the dateline)?
I have tried various combinations of st_shift_longitude to no avail. I have also tried transforming to what I think is a planar projection (2163), and that didn't work.
Here's how I'm currently trying to do this:
library(sf)
library(maps)
library(ggplot2)
library(tidyverse)
# this is the shapefile from the link above
eez_unshifted <- read_sf("USMaritimeLimitsAndBoundariesSHP/USMaritimeLimitsNBoundaries.shp") %>%
filter(OBJECTID == 1206) %>%
st_transform(4326)
eez_shifted <- read_sf("USMaritimeLimitsAndBoundariesSHP/USMaritimeLimitsNBoundaries.shp") %>%
filter(OBJECTID == 1206) %>%
st_transform(4326) %>%
st_shift_longitude()
# four points, in and out of the geometry, on either side of the dateline
pnts <- tibble(x=c(-171.952474,176.251978,179.006220,-167.922929),y=c(25.561970,17.442716,28.463375,15.991429)) %>%
st_as_sf(coords=c('x','y'),crs=st_crs(eez_unshifted))
# these all return false for every point
st_within(pnts,eez_unshifted)
st_within(st_shift_longitude(pnts),eez_unshifted)
st_within(pnts,eez_shifted)
st_within(st_shift_longitude(pnts),eez_shifted)
# these also all return false for every point
st_intersects(pnts,eez_unshifted)
st_intersects(st_shift_longitude(pnts),eez_unshifted)
st_intersects(pnts,eez_shifted)
st_intersects(st_shift_longitude(pnts),eez_shifted)
# plot the data just to show that it looks right
wrld2 <- st_as_sf(maps::map('world2', plot=F, fill=T))
ggplot() +
geom_sf(data=wrld2, fill='gray20',color="lightgrey",size=0.07) +
geom_sf(data=eez_shifted) +
geom_sf(data=st_shift_longitude(pnts)) +
coord_sf(xlim=c(100,290), ylim=c(-60,60)) +
xlab("Longitude") +
ylab("Latitude")
The answer is to make sure the geometry you're checking against is a polygon:
> eez_poly <- st_polygonize(eez_shifted)
> st_within(pnts,eez_poly)
although coordinates are longitude/latitude, st_within assumes that they are planar
Sparse geometry binary predicate list of length 4, where the predicate was `within'
1: 1
2: (empty)
3: 1
4: (empty)

Apportioning data from one geography to another ​

We have two geographies: census tracts and a squared grid. The grid dataset only has information on population count. We have information on the total income of each census tract. What we would like to do is to apportion these income data from the census tracts to the grid cells.
This is a very common problem in geographical analysis and there're probably many ways to address it. We want to do this considering not only the spatial overlap between census tracts and grid cells but also considering the population of each cell. This is mainly to avoid problems when there is a large census tract that may contain people living only in a small area.
We present below a reproducible example (using R and the sf package) and the solution we've found to this problem so far, using a sample we extracted from our geographies. We would appreciate to see if others have alternative (more efficient) solutions to check if our results are correct.
library(sf)
library(dplyr)
library(readr)
# Files
download.file("https://github.com/ipeaGIT/acesso_oport/raw/master/test/shapes.RData", "shapes.RData")
load("shapes.RData")
# Open tracts and calculate area
tract <- tract %>%
mutate(area_tract = st_area(.))
# Open grid squares and calculate area
square <- square %>%
mutate(area_square = st_area(.))
ui <-
# Create spatial units for all intersections between the tracts and the squares (we're calling these "piece")
st_intersection(square, tract) %>%
# Calculate area for each piece
mutate(area_piece = st_area(.)) %>%
# Compute the proportion of each tract that's inserted in that piece
mutate(area_prop_tract = area_piece/area_tract) %>%
# Compute the proportion of each square that's inserted in that piece
mutate(area_prop_square = area_piece/area_square) %>%
# Based on the square's population, compute the population that lives in that piece
mutate(pop_prop_square = square_pop * area_prop_square) %>%
# Compute the population proportion of each square that is within the tract
group_by(id_tract) %>%
mutate(sum = sum(pop_prop_square)) %>%
ungroup() %>%
# Compute population of each piece whitin the tract
mutate(pop_prop_square_in_tract = pop_prop_square/sum) %>%
# Compute income within each piece
mutate(income_piece = tract_incm* pop_prop_square_in_tract)
# Final agreggation by squares
ui_fim <- ui %>%
# Group by squares and population and sum the income for each piece
group_by(id_square, square_pop) %>%
summarise(square_income = sum(income_piece, na.rm = TRUE))
Thank you!
Depending on the approach to interpolation you want to use, I may have a solution for you that I've helped develop. The areal package implements areal weighted interpolation, and I use it in my own research from interpolating between U.S. census geography and grid squares. You can check out the package's website (and associated vignettes) here. Hope this is useful!

Filter shapefile polygons by area

I have the following boundary dataset for the United Kingdom, which shows all the counties:
library(raster)
library(sp)
library(ggplot)
# Download the data
GB <- getData('GADM', country="gbr", level=2)
Using the subset function it is really easy to filter the shapefile polygons by an attribute in the data. For example, if I want to exclude Northern Ireland:
GB_sub <- subset(UK, NAME_1 != "Northern Ireland")
However, there are lots of small islands which distort the scale data range, as shown in the maps below:
Any thoughts on how to elegantly subset the dataset on a minimum size? It would be ideal to have something in the format consistent with the subset argument. For example:
GB_sub <- subset(UK, Area > 20) # specify minimum area in km^2
Here is another potential solution. Because your data is in lat-long projection, directly calculating the area based on latitude and longitude would cause bias, it is better to calculate the area based on functions from the geosphere package.
install.packages("geosphere")
library(geosphere)
# Calculate the area
GB$poly_area <- areaPolygon(GB) / 10^6
# Filter GB based on area > 20 km2
GB_filter <- subset(GB, poly_area > 20)
poly_area contains the area in km2 for all polygons. We can filter the polygon by a threshold, such as 20 in your example. GB_filter is the final output.
This is one potential solution:
GB_sub = GB[sapply(GB#polygons, function(x) x#area>0.04),] # select min size
map.df <- fortify(GB_sub)
ggplot(map.df, aes(x=long, y=lat, group=group)) + geom_polygon()
Check this link for specifics on the actual interpretation of km2 size: Getting a slot's value of S4 objects?
I compared both as well but they don't seem to differ:
out1 = sapply(GB#polygons, function(x) x#area)
out2 = rgeos::gArea(GB, byid=TRUE)

Resources