I have two problems I'm trying to solve, the first issue is the main one. Hopefully I've explained the second one decently.
1) My initial issue is trying to create spatial polygon dataframe from a tibble. For example, I have a tibble that outlines U.S. states, from the urbnmapr library and I want to be able to plot spatial polygons for all 50 states. (Note: I already have made a map from these data in ggplot but I specifically want spatial polygons to plot and animate in leaflet):
> states <- urbnmapr::states
> states
# A tibble: 83,933 x 10
long lat order hole piece group state_fips state_abbv state_name fips
<dbl> <dbl> <int> <lgl> <fct> <fct> <chr> <chr> <chr> <chr>
1 -88.5 31.9 1 FALSE 1 01.1 01 AL Alabama 01
2 -88.5 31.9 2 FALSE 1 01.1 01 AL Alabama 01
3 -88.5 31.9 3 FALSE 1 01.1 01 AL Alabama 01
...
2) Once I do this, I will want to join additional data from a separate tibble to the spatial polygons by the state name. What would be the best way to do that if I different data for each year? i.e. for the 50 states I have three years of data, so would I create 150 different polygons for the states across years or have 50 state polygons but have all the information in each to be able to make 3 different plots of all states for the different years?
I can propose you the following (unchecked because I don't have access to the urbnmapr package with my R version).
Problem 1
If you specifically want polygons, I think the best would be to join a dataframe to an object that comes from a shapefile.
If you still want to do it on your own, you need to do two things:
Convert your tibble into a spatial object with a point geometry
Aggregate points by state
sf package can do both. For the first step (the easy one), use sf_as_sf function.
library(sf)
states
states_spat <- states %>% st_as_sf(., coords = c("lon","lat"))
For the second step, you will need to aggregate geometries. I can propose you something that will give you a MULTIPOINT geometry, not polygons. To convert into polygons, you could find this thread to help
states_spat <- states_spat %>% group_by(state_name) %>%
dplyr::summarise(x = n())
Problem 2
That's a standard join based on a common attributes between your data and a spatial object (e.g. a state code). merge or *_join functions from dplyr work with sf data as they would do with tibbles. You have elements there
By the way, I think it is better for you to do that than creating your own polygons from a series of points.
Related
Overview
Using R, I would like to count the number of points within a polygon according to a specific criterion (a temporal window).
I have the following data:
Geo-located survey data that include the date of the survey interview. Thus, I am able to pinpoint exactly when and where each survey was conducted and to map them out across the United States.
Geo-located data about political rallies across the United States. These also include the date.
Using QGIS, I created a set of circular 50 mile buffers around each survey respondent. My goal is to count the number of political rallies that falls within each respondent's "buffer" within a specific time frame preceding the interview. The 50 mile buffers created in QGIS retain all variables of the original data, including the date of the interview.
Data
Using QGIS, I created some mock shapefiles with dates and locations to aid in replication.
Approach
I am trying to use GISTools::poly.counts to count the number of rallies within different temporal windows (30 days, 90 days, etc.).
Generally, to count the number of points within a polygon, I would simply use:
count <- GISTools::poly.counts(rallies, buffer)
This gives me the total number of rallies that occur within each buffer, but doesn't allow me to specify temporal windows. For example, it would be great to develop a count of the number of rallies within a buffer for the 30 days preceding the survey interview as well as the 90 days preceding the interview.
Remember, each polygon within my buffer shapefile has a different date of interview.
Here's what I've tried, but it's not working:
buffer$count_30 <- GISTools::poly.counts(
rallies[buffer$date - rallies$date > 0 & buffer$date - rallies$date <= 30],
buffer)
I get the following error:
Error in `[.data.frame`(x#data, i, j, ..., drop = FALSE) :
undefined columns selected
In addition: Warning messages:
1: In unclass(time1) - unclass(time2) :
longer object length is not a multiple of shorter object length
2: In unclass(time1) - unclass(time2) :
longer object length is not a multiple of shorter object length
What is the correct way to accomplish this?
I approached your problem in a different way by using the sf package instead of GISTools. The algorithm is straightforward and you can easily adapt it to your GISTools::poly.counts() method:
Read in shapefiles (st_read())
Filter shapefiles by date using dplyr (make sure you've got Date objects to create the windows)
Find the intersection of whichever points data with the rally buffer (st_intersection())
Get the size of the intersection object (nrow())
Likely you'll have to tweak the function parameters to make sure it works correctly for the real data. Below is an example using your mock data.
Setup and read in the data (note stringsAsFactors=F just makes the dates easier to create; not necessary for R version 4.x).
require(tidyverse)
require(magritter) #adds the %<>% operator
require(sf)
require(lubridate)
rally <- st_read(dsn=getwd(),layer='rallies',stringsAsFactors = F)
buff <- st_read(dsn=getwd(),layer='50m_buffer',stringsAsFactors = F)
surv <- st_read(dsn=getwd(),layer='surveys',stringsAsFactors = F)
Create the date objects.
rally %<>% mutate(date=ymd(date))
buff %<>% mutate(date=ymd(date))
surv %<>% mutate(date=ymd(date))
window <- c(ymd('2020-03-27')-30, ymd('2020-03-27')+30)
Filter the data using the temporal window.
buffSub <- buff %>%
filter(date>=window[1] & date<=window[2])
rallySub <- rally %>%
filter(date>=window[1] & date<=window[2])
Get the number if intersecting points.
intersectObject <- st_intersection(rallySub, buffSub)
nrow(intersectObject)
Or if you want to use the days since a rally or something along those lines you can create new columns in whichever points object which represent the time difference between a rally and an active buffer and use those values to filter.
Loop through the dates for each rally and get the time difference with each buffer.
daysDiff <- data.frame(t(sapply(rally$date, function(d) d-buff$date)))
Add those columns to the data and rename with buff1, buff2, etc.
rallyNew <- bind_cols(rally, daysDiff) %>%
rename_with(~gsub('X', 'buff', .x))
Use those values to filter. Go one column at a time, filter, and get the intersection with buffer associated with that column.
WINDOW=20
for(i in 4:ncol(rallyNew)){
rallySub <- rallyNew %>%
filter(get(unlist(names(rallyNew))[i])<WINDOW &
get(unlist(names(rallyNew))[i])>-WINDOW)
intersectObject <- st_intersection(rallySub, buffSub[i-3,])
print(nrow(intersectObject))
}
Another answer using sf, but this time using spatial joins and dplyr for filtering etc.
library(tidyverse)
library(sf)
rallies <- read_sf('Downloads/stack_ex_q/rallies.shp')
# Here I don't use the supplied buffer, but make one according to the data
#fifty_buff <- read_sf('Downloads/stack_ex_q/rallies.shp')
surveys <- read_sf('Downloads/stack_ex_q/surveys.shp')
# Transform to a crs using meters as a distance & make date col a proper date
rallies <- st_transform(rallies, crs = 2163) %>%
mutate(date = as.Date(date))
surveys <- st_transform(surveys, crs = 2163) %>%
mutate(date = as.Date(date))
# make a buffer w/ 50 mile radius (80467 meters), not used but useful for visualization
buffer_50mi <- st_buffer(surveys, dist = 80467)
Plot the data for a quick visual check:
library(mapview)
mapview(rallies, col.regions = 'purple') +
mapview(surveys, col.regions = 'black') +
mapview(buffer_50mi, col.regions = 'green')
Join the data using st_is_within_distance, using 80467m = 50 miles.
joined <- surveys %>%
st_join(rallies, join = st_is_within_distance, 80467)
head(joined)
Simple feature collection with 6 features and 4 fields
geometry type: POINT
dimension: XY
bbox: xmin: 1350401 ymin: -556609 xmax: 1438586 ymax: -455743.1
projected CRS: NAD27 / US National Atlas Equal Area
# A tibble: 6 x 5
id.x date.x geometry id.y date.y
<dbl> <date> <POINT [m]> <dbl> <date>
1 1 2020-04-26 (1350401 -556609) 16 2020-02-19
2 1 2020-04-26 (1350401 -556609) 17 2020-05-12
3 2 2020-03-27 (1438586 -455743.1) 7 2020-02-18
4 2 2020-03-27 (1438586 -455743.1) 15 2020-07-01
5 2 2020-03-27 (1438586 -455743.1) 15 2020-03-28
6 3 2020-06-12 (1352585 -479940.5) 15 2020-07-01
The .x columns are from the survey sf object & the .y columns are from the rallies sf object. Geometry is retained from the survey sf.
Using dplyr's filter, group_by, and mutate, find what you're looking for. The code below counts rallies within 50 miles and +/- 60 days by survey point as an example.
joined_60days <- joined %>%
group_by(id.x) %>%
mutate(date_diff = as.numeric(date.x - date.y)) %>%
filter(!is.na(date_diff)) %>% ## remove survey points with no rallies in 50mi/60d
filter(abs(date_diff) <= 60) %>%
group_by(id.x) %>%
count()
head(joined_60days)
Simple feature collection with 4 features and 2 fields
geometry type: POINT
dimension: XY
bbox: xmin: 1268816 ymin: -556609 xmax: 1438586 ymax: -322572.4
projected CRS: NAD27 / US National Atlas Equal Area
# A tibble: 4 x 3
id.x n geometry
<dbl> <int> <POINT [m]>
1 1 1 (1350401 -556609)
2 2 2 (1438586 -455743.1)
3 3 1 (1352585 -479940.5)
4 4 2 (1268816 -322572.4)
Quick visual check:
I have a shapefile of the Philippines that has all the correct labels of each provinces. After removing some of the provinces I won't be using, aggregating the data into a single data frame, and then attaching my covariates to the shapefile I run into trouble. Using tmap to create some maps, the provinces are mislabeled and therefore, different data is applied to different provinces I am doing a spatial-temporal analysis with this data, so it's important the provinces are in the correct locations.
I have tried retrojecting some of the shapefile, but it doesn't seem to work.
#reading in shapefile
shp <- readOGR(".","province.csv")
#removing provinces not in data from shapefile
myshp82=shp
shp#data$prov=as.character(shp#data$prov)
ind=shp#data$prov%in% mydata$prov
shp.subset=shp[ind,]
#attaching covariates to shapefile for plotting, myagg is my data frame.
#The shape files are divided in four different time periods.
myagg_time1=myagg[myagg$period==1,]
myagg_time2=myagg[myagg$period==2,]
myagg_time3=myagg[myagg$period==3,]
myagg_time4=myagg[myagg$period==4,]
myshptime1=myshptime2=myshptime3=myshptime4=shp
myshptime1#data=merge(myshptime1#data, myagg_time1, by='prov',all.x=TRUE)
myshptime2#data=merge(myshptime2#data, myagg_time2, by='prov',all.x=TRUE)
myshptime3#data=merge(myshptime3#data, myagg_time3, by='prov',all.x=TRUE)
myshptime4#data=merge(myshptime4#data, myagg_time4, by='prov',all.x=TRUE)
#desc maps. Here's the code I've been using for one of the maps.
Per1= tm_shape(myshptime1)+
tm_polygons(c('total_incomeMed','IRA_depMean','pov'), title=c('Total Income', 'IRA', 'Poverty (%)'))+
tm_facets(sync = TRUE, ncol=3)
#sample data from my data sheet "myagg". First column is provinces.
period counts total_income_MED IRA_depMean
Agusan del Norte.1 1 2 119.33052 0.8939136
Agusan del Norte.2 2 0 280.96928 0.8939136
Agusan del Norte.3 3 1 368.30082 0.8939136
Agusan del Norte.4 4 0 368.30082 0.8950379
Aklan.5 1 0 129.63132 0.8716863
Aklan.6 2 3 282.95535 0.8716863
Aklan.7 3 3 460.29969 0.8716863
Aklan.8 4 0 460.29969 0.8437920
Albay.9 1 0 280.12221 0.8696165
Albay.10 2 3 453.05098 0.8696165
Albay.11 3 1 720.40732 0.8696165
Albay.12 4 0 720.40732 0.8254676
Essentially the above tmap code creates three maps for this time period side-by-side for each of the different covariates ('total_incomeMed','IRA_depMean','pov'). This is happening, but the provinces are mislabeled and the data is tied to the name of the province. I just need the provinces properly labeled!
Sorry if this doesn't make sense. Happy to clarify more if needed.
I have been working on leaflet in R.
https://rstudio.github.io/leaflet/choropleths.html
The above us-Map contains density of a state.The Format of the data is Geo-Json. I want to remove the density variable and I want to pass my columnname with corresponding variable value. (For Example when you hover on the New Mexico I am getting density as 17.16 (density:17.16), instead I want to display as (mycolumnname:value) ).
This is a pretty common need in working with leaflet. There are a few ways to do this, but this is the simplest in my mind:
All of the information you would like to plot is stored in the section of the SpatialPolygonsDataFrame found at states#data, which you can see by looking at the head of this data frame section:
I made a data frame (traditional r data frame) using the state names from the original SpatialPolygonsDataFrame names states in your code above and created my_var.
a<-data.frame( States=states#data$name)
a$my_var <- round(runif(52, 15, 185),2)
This is the first few rows of my new data frame, which is like yours but has data OTHER than density in it.
head(a)
States my_var
1 Alabama 120.33
2 Alaska 179.41
3 Arizona 67.92
4 Arkansas 30.57
5 California 72.26
6 Colorado 56.33
Now that you have this data frame you can call up the library maptools and do a polygon cbind as follows:
states2<-spCbind(states,a$my_var)
Now looking at the head of states2 (which you could name states and replace the original states SpatialPolygonsDataFrame I kept both to compare before and after)
head(states2#data)
id name density data.my_var
0 01 Alabama 94.650 58.01
1 02 Alaska 1.264 99.01
2 04 Arizona 57.050 81.05
3 05 Arkansas 56.430 124.68
4 06 California 241.700 138.19
5 08 Colorado 49.330 103.78
this added the data.my_var variable into the spatial data frame. Now you can use find/replace, to go through and replace the references in your code where it says density with data.my_var and the new variables will be used.
Important things to consider
Your data has 50 state names, the spatial data frame has 52, you will need to add in the missing states to your data frame before cBinding them, they must be the same length AND in the same order.
If you grab the names like this:
a<-data.frame( States=states#data$name)
from the states object, you can then left merge on States, with your data and it will keep the order a and all the cells which are empty where the new regions have not data in your data set will remain empty.
Use merge to be sure that data lines up properly.
a<- merge(a, your_data ,by=c("States","name"))
Also, once they are merged and you have checked that states#data$name is in the same order as a$States, you can use any name you want as new heading in the SpatialPolygonDataFrame by extracting the data into a vector with the name you want prior to binding them:
my_var <- a$my_var
states2<-spCbind(states, my_var)
this will leave you with a data frame which looks like this:
id name density my_var
0 01 Alabama 94.650 58.01
1 02 Alaska 1.264 99.01
This is easier to address as a column name from inside leaflet without long strings.
I am looking for a way to shade counties on the US maps in R. I have list of numeric/char county FIPS code that I can input as parameter. I just need to highlight these counties -- so would just need to shade them and there are no values or variations corresponding to the counties. I tried to look up
library(choroplethr)
library(maps)
and
county_choropleth(df_pop_county)
head(df_pop_county)
region value
1 1001 54590
2 1003 183226
3 1005 27469
4 1007 22769
5 1009 57466
6 1011 10779
But these need a region, value pair. For e.g.,fips code and population in the above. Is there a way to call the county_choropleth function without having to use the values, just with the fipscode dataframe. In that way, I can my fips code with one color. What would be an efficient way to accomplish this in R using Choroplethr?
Here's an example using the maps library:
library(maps)
library(dplyr)
data(county.fips)
## Set up fake df_pop_county data frame
df_pop_county <- data.frame(region=county.fips$fips)
df_pop_county$value <- county.fips$fips
y <- df_pop_county$value
df_pop_county$color <- gray(y / max(y))
## merge population data with county.fips to make sure color column is
## ordered correctly.
counties <- county.fips %>% left_join(df_pop_county, by=c('fips'='region'))
map("county", fill=TRUE, col=counties$color)
Here's the resulting map:
Notice that counties with lower FIPS are darker, while counties with higher FIPS are lighter.
Suppose there is a dataset of different regions, each region a subset of a state, and some outcome variable:
regions <- c("Michigan, Eastern",
"Michigan, Western",
"Minnesota",
"Mississippi, Northern",
"Mississippi, Southern",
"Missouri, Eastern",
"Missouri, Western")
set.seed(123)
outcome <- rpois(7, 12)
testset <- data.frame(regions,outcome)
regions outcome
1 Michigan, Eastern 10
2 Michigan, Western 11
3 Minnesota 17
4 Mississippi, Northern 12
5 Mississippi, Southern 12
6 Missouri, Eastern 17
7 Missouri, Western 13
A useful tool would aggregate each region and add, or take the mean or maximum, etc. of outcome by region and generate a new data frame for state. A sum, for example, would output this:
state outcome
1 Michigan 21
3 Minnesota 17
4 Mississippi 24
6 Missouri 30
The aggregate() function won't solve this problem. Is there something else in R that is built for this? It seems like grep could be used to generate the new column "states" as part of an application specific program. Seems like this would already be out there somewhere though.
The reason this isn't straight forward is that the structure of your data is not consistent, so you couldn't build a library simply for it.
Your state, region column is basically an index column, and you want to index across part of it. tapply is designed for this, but there's no reason to build in a function to do it automatically for this specific scenario. You could do it without creating the column though
tapply(outcome,gsub(",.*$","",testset$regions),sum)
The index column just replaces the , and everything after it, leaving the index column.
PS: you have a slight typo in your example, your data.frame should be
testset <- data.frame(regions,outcome)