Apportioning data from one geography to another ​ - r

We have two geographies: census tracts and a squared grid. The grid dataset only has information on population count. We have information on the total income of each census tract. What we would like to do is to apportion these income data from the census tracts to the grid cells.
This is a very common problem in geographical analysis and there're probably many ways to address it. We want to do this considering not only the spatial overlap between census tracts and grid cells but also considering the population of each cell. This is mainly to avoid problems when there is a large census tract that may contain people living only in a small area.
We present below a reproducible example (using R and the sf package) and the solution we've found to this problem so far, using a sample we extracted from our geographies. We would appreciate to see if others have alternative (more efficient) solutions to check if our results are correct.
library(sf)
library(dplyr)
library(readr)
# Files
download.file("https://github.com/ipeaGIT/acesso_oport/raw/master/test/shapes.RData", "shapes.RData")
load("shapes.RData")
# Open tracts and calculate area
tract <- tract %>%
mutate(area_tract = st_area(.))
# Open grid squares and calculate area
square <- square %>%
mutate(area_square = st_area(.))
ui <-
# Create spatial units for all intersections between the tracts and the squares (we're calling these "piece")
st_intersection(square, tract) %>%
# Calculate area for each piece
mutate(area_piece = st_area(.)) %>%
# Compute the proportion of each tract that's inserted in that piece
mutate(area_prop_tract = area_piece/area_tract) %>%
# Compute the proportion of each square that's inserted in that piece
mutate(area_prop_square = area_piece/area_square) %>%
# Based on the square's population, compute the population that lives in that piece
mutate(pop_prop_square = square_pop * area_prop_square) %>%
# Compute the population proportion of each square that is within the tract
group_by(id_tract) %>%
mutate(sum = sum(pop_prop_square)) %>%
ungroup() %>%
# Compute population of each piece whitin the tract
mutate(pop_prop_square_in_tract = pop_prop_square/sum) %>%
# Compute income within each piece
mutate(income_piece = tract_incm* pop_prop_square_in_tract)
# Final agreggation by squares
ui_fim <- ui %>%
# Group by squares and population and sum the income for each piece
group_by(id_square, square_pop) %>%
summarise(square_income = sum(income_piece, na.rm = TRUE))
Thank you!

Depending on the approach to interpolation you want to use, I may have a solution for you that I've helped develop. The areal package implements areal weighted interpolation, and I use it in my own research from interpolating between U.S. census geography and grid squares. You can check out the package's website (and associated vignettes) here. Hope this is useful!

Related

Why is tidycensus area provided different that calculated by sf::_st_area()

I am using the tidycensus R package to pull in census data and geometries. I want to be able to calculate population densities and have the results match what I see on censusreporter.org. I am noticing a difference between the geography variables returned from tidycenus compared to what I calculate myself using the sf package sf::st_area() function.
library(tidyverse)
library(tidycensus)
census_api_key("my_api_key")
library(sf)
options(tigris_use_cache = TRUE)
pop_texas <-
get_acs(geography = 'state',
variables = "B01003_001", # Total Population
year = 2020,
survey = 'acs5',
keep_geo_vars = TRUE,
geometry = TRUE) %>%
filter(GEOID == '48') # Filter to Texas
Since I included the keep_geo_vars argument as TRUE it returned an ALAND column which I believe is the correct area for the geography returned in square meters (m^2).
> pop_texas$ALAND %>% format(big.mark=",")
[1] "676,680,588,914"
# Conversion to square miles
> (pop_texas$ALAND / 1000000 / 2.5899881) %>% format(big.mark=",")
[1] "261,267.8"
When I convert the ALAND amount to square miles I get the same number as shown on censusreporter.org:
I have also tried to calculate the area using the sf::st_area() function, but I get a different result:
> sf::st_area(pop_texas) %>% format(big.mark=",", scientific=FALSE)
[1] "688,276,954,146 [m^2]"
# Conversion to square miles
> (sf::st_area(pop_texas) / 1000000 / 2.5899881) %>%
+ as.numeric() %>%
+ format(big.mark=",", scientific=FALSE)
[1] "265,745.2"
Please let me know if there is something I am missing to reconcile these numbers. I would expect to get the same results either directly through tidycensus or calculating the area using sf::st_area().
Right now I am off by a lot:
> (pop_texas$ALAND - as.numeric(st_area(pop_texas)) ) %>%
+ format(big.mark=",")
[1] "-11,596,365,232"
If you want the "official" area of a shape like Texas you should always use the ALAND or published area value. st_area() is using geometry to calculate the area of the polygon which is always going to be a simplified and imperfect representation of Texas (or any other area). For smaller shapes (like Census tracts) the calculations will probably be pretty close; for larger shapes like states (especially those with complex coastal geography, like Texas) you're going to be further off.
These differences are usually due to the CRS (the projection used on your sf objects). Some projections distort area, other projections distors the shape. See this to learn more http://wiki.gis.com/wiki/index.php/Distortion#:~:text=There%20are%20four%20main%20types,%2C%20direction%2C%20shape%20and%20area.

Filling road network data gaps

I have a motorway network with count points that can be matched to road links. However, they only match around half the osm links. The network is uni directional and it should be possible to assign data from joining links to the missing links.
I currently have a rather ugly and long solution based on a WHILE loop that sequentially fills the connecting links. However, I think a more elegant solution might be possible by using an sfnetwork or spatial lines network. The packages stplanr, sfnetwork and dodger closely match what I want to do, but all seem to focus on routing and origin destination data.
Below is a reproducible example that uses a small area of UK motorway network and removes a random sample of half the links and generates flow and speed data for the half remaining.
How do I fill in the missing links with data from either end of the missing links?
library(tidyverse)
library(mapview)
library(sf)
library(osmdata)
## define area to import osm data
x_max <- -2.31
x_min <- -2.38
y_max <- 51.48
y_min <- 51.51
##create a data frame to setup a polygon generation
df <- data.frame(X = c(x_min, x_max, x_max, x_min),
Y = c(y_max, y_max, y_min, y_min))
##generate a polygon of the area
rd_area <- df %>%
st_as_sf(coords = c("X", "Y"), crs = 4326) %>%
dplyr::summarise(geometry = st_combine(geometry)) %>%
st_cast("POLYGON")
##get osm geometry for motorway links for defined area
x <- opq(bbox = rd_area) %>%
add_osm_feature(key = c('highway'), value = c('motorway',
'motorway_link')) %>% osmdata_sf()
## extract line geometry, generate a unique segment ID and get rid of excess columns
rdz <- x$osm_lines %>%
mutate(seg_id = paste0("L", sprintf("%02d", 1:NROW(bicycle)))) %>%
select(seg_id)
## pretend we only have traffic counts and speeds for half the links
osm_dat <- rdz[c(3,4,5,7,11,14,15),]
## links without data
osm_nodat <- filter(rdz, !seg_id %in% osm_dat$seg_id)
## visualise links with data and without
mapview(osm_dat, color = "green")+mapview(osm_nodat, color = "red")
## make up some data to work with
pretend_counts <- st_centroid(osm_dat)
## assign some random annual average daily flow and speed averages
pretend_counts$aadt <- sample(200:600, nrow(pretend_counts))
pretend_counts$speed <- sample(40:80, nrow(pretend_counts))
Here is one quick and elegant solution from the Cyipt project https://github.com/cyipt/cyipt/blob/master/scripts/prep_data/get_traffic.R
It uses the code from the get.aadt.class function and uses Voroni polygons to give the flows and speeds to the nearest roads. However, it doesn't distribute, i.e. split the flows where one links meets two and it sometimes results in opposing directions having the same flows and speeds.
library(dismo) ## dismo package for voroni polygon generation
#Make voronoi polygons and convert to SF
voronoi <- dismo::voronoi(x = st_coordinates(pretend_counts))
voronoi <- as(voronoi, "sf")
st_crs(voronoi) <- st_crs(pretend_counts)
#Find Intersections of roads with vernoi polygons
inter <- st_intersects(osm_nodat,voronoi)
#Get aadt and ncycle values
osm_nodat$aadt <- as.numeric(lapply(1:nrow(osm_nodat),function(x){as.numeric(round(mean(pretend_counts$aadt[inter[[x]]])),0)}))
osm_nodat$speed <- as.numeric(lapply(1:nrow(osm_nodat),function(x){as.numeric(round(mean(pretend_counts$speed[inter[[x]]])),0)}))
#Remove Unneded Data
all_osm <- as.data.frame(rbind(osm_dat, osm_nodat))
st_geometry(all_osm) <- all_osm$geometry
flows <- dplyr::select(all_osm, aadt)
mapview(flows)

How to calculate the geographic and language distance between all countries and the USA in R

I am trying to measure the geographic and language distance between all countries and the United States (e.g. the distance between Canada and the USA, Iran and USA, and etc).
The data is available via http://www.cepii.fr/pdf_pub/wp/2011/wp2011-25.pdf but I am unsure how to actually calculate the distance with the long/lat info, and the language info.
Thanks!
Edited to add photo: I want to create a 4th variable that shows the distance between each country and the USA (38.8670000, -77.0000000). Example of data I know the distHaversine gives allows me to do this one by one, but how can I use the distHaversine to compute this for all the countries in my dataset (each country's distance to the US)
I tried this:
mutate(geo,
Distance = distHaversine(cbind(lon, lat),
c(38.8670000, -77.0000000)))
but it returns incorrect distances.
Figured it out!
myLong <- -77.0000000
myLat <-38.8670000
geo1 %>% mutate(dist = distHaversine(cbind(myLong, myLat), cbind(lon, lat)))

County average from latitude and longitude

I have a big data frame (832k rows) with latitude and longitude in a gridded format plus one variable. I would like to plot the average of this variable per county. The problem is that I do not have the identification of county or state by point, only the coordinates.
Sorry, I am not sure how to include a replicable example
Two approaches:
1) Calculate average of all the lat/lon grids. This approach skews your county centre towards higher density grids
2) Calculate bounds[min-max lat/lon] of grids and average the bounds. This approach places the county centre in exactly centre of the grid span.
You will need to obtain the county (or state) data and then spatially join it with your dataframe. One possible source for such data is the TIGER shapefile published by the U.S. Census (see e.g. https://catalog.data.gov/dataset/tiger-line-shapefile-2016-nation-u-s-current-county-and-equivalent-national-shapefile).
You can then use the sf package to read the shapefile into R, join it with your data, and then use regular summary functions to summarize your data by county.
library(sf)
filename <- 'https://www2.census.gov/geo/tiger/TIGER2016/COUNTY/tl_2016_us_county.zip'
tmpfile <- tempfile()
tmpdir <- tempdir()
download.file(filename,tmpfile)
unzip(zipfile = tmpfile, exdir = tmpdir)
county_data <- st_read(paste0(tmpdir, '/tl_2016_us_county.shp'))
unlink(tmpfile)
unlink(tmpdir)

Filter shapefile polygons by area

I have the following boundary dataset for the United Kingdom, which shows all the counties:
library(raster)
library(sp)
library(ggplot)
# Download the data
GB <- getData('GADM', country="gbr", level=2)
Using the subset function it is really easy to filter the shapefile polygons by an attribute in the data. For example, if I want to exclude Northern Ireland:
GB_sub <- subset(UK, NAME_1 != "Northern Ireland")
However, there are lots of small islands which distort the scale data range, as shown in the maps below:
Any thoughts on how to elegantly subset the dataset on a minimum size? It would be ideal to have something in the format consistent with the subset argument. For example:
GB_sub <- subset(UK, Area > 20) # specify minimum area in km^2
Here is another potential solution. Because your data is in lat-long projection, directly calculating the area based on latitude and longitude would cause bias, it is better to calculate the area based on functions from the geosphere package.
install.packages("geosphere")
library(geosphere)
# Calculate the area
GB$poly_area <- areaPolygon(GB) / 10^6
# Filter GB based on area > 20 km2
GB_filter <- subset(GB, poly_area > 20)
poly_area contains the area in km2 for all polygons. We can filter the polygon by a threshold, such as 20 in your example. GB_filter is the final output.
This is one potential solution:
GB_sub = GB[sapply(GB#polygons, function(x) x#area>0.04),] # select min size
map.df <- fortify(GB_sub)
ggplot(map.df, aes(x=long, y=lat, group=group)) + geom_polygon()
Check this link for specifics on the actual interpretation of km2 size: Getting a slot's value of S4 objects?
I compared both as well but they don't seem to differ:
out1 = sapply(GB#polygons, function(x) x#area)
out2 = rgeos::gArea(GB, byid=TRUE)

Resources