Intersecting big spatial datasets in R SF - r

I have two spatial datasets. One dataset contains lots polygons (more than 150k in total) specifying different features, like rivers, vegetation. The other dataset contains much less polygons (500) specifying different areas.
I need to intersect those two datasets to get the features in the different areas.
I can subset the first dataset by the different features. If I use a subset from a small feature (2,500 polygons) the intersection with the areas is quite fast (5min). But if I want to interest a bigger feature subset (20,000 polygons) the computation runs really long (I terminated it after two hours). And this is not even the biggest feature (50,000 polygons) I need to intersect.
This is the code snipped I run:
clean_intersect_save = function(geo_features, areas) {
# make geometries valid
data_valid_geoms = st_parallel(sf_df = st_geometry(geo_features),
sf_func = st_make_valid,
n_cores = 4)
# remove unnecessary columns
data_valid = st_drop_geometry(x) %>% select("feature")
data_valid = st_sf(data_clean, geometry = data_valid_geoms)
# intersect the geo-features and areas
data_valid_split = st_parallel(sf_df = bezirke,
sf_func = st_intersection,
n_cores = 4,
data_clean)
# save shp file
st_write(data_valid_split, "data_valid_splir.shp")
return(data_valid_split)
}
Where both inputs are sf data frames.
st_parallel is a function I found
here.
My question is: How would experienced spatial data people solve such a task usually? Do I just need more cores and/or more patience? Am I using sf wrong? Is R/sf the wrong tool?
Thanks for any help.
This is my very first spatial data analysis project, so sorry if I oversee some obvious thinks.

As there probably won´t come a real answer to this vague question I will answer it on my own.
Thanks #Chris and #TimSalabim for the help. I ended up with a combination of both ideas.
I ended up using PostGIS which is from my experience a pretty intuitive way to work with spatial data.
The three things which speeded up the calculations of intersection for me are:
In my chase the spatial data was stored in MULTIPOLYGONS when loading from shapefile. I expanded those into POLYGONS using ST_DUMP:
https://postgis.net/docs/ST_Dump.html
I created a Spatial Index on the POLYGONS: https://postgis.net/workshops/postgis-intro/indexing.html
I used a combination of ST_Intersection and ST_Intersects to only call the costly ST_Intersection when realy needed (As #TimSalabim suggested, this approach could also speed up things in R....But I currently have no time to test this approach): https://postgis.net/2014/03/14/tip_intersection_faster/

Related

Intersection and difference of PostGIS data using R

I am an absolute beginner in PostgreSQL and PostGIS (databases in general) but have a fairly good working experience in R. I have two multi-polygon data sets of vulnerable areas of India from two different sources - one is around 12gb and it's in .gdb format (let's call it mygdb) and the other is a shapefile around 2gb (let's call it myshp). I want to compare the two sets of vulnerability maps and generate some state-wise measures of fit using intersection (I), difference (D), and union (U) between the maps.
I would like to make use of PostGIS functionalities (via R) as neither R (crashes!) nor qgis (too slow) is efficient for this. To start with, I have uploaded both data sets in my PostGIS database. I used ogr2ogr in R to upload mygdb. But I am kind of stuck at this point. My idea is to split both polygon files by states and then apply other functions to get I, U and D. From my search, I think I can use sf functions like st_split, st_intersect, st_difference, and st_union. However, even after splitting, I would imagine that the file sizes will be still too large for r to process, so my questions are
Is my approach the best way forward?
How can I use sf::st_ functions (e.g. st_split, st_intersection) without importing the data from database into R
There are some useful answers to previous relevant questions, like this one for example. But I find it hard to put the steps together from different links and any help with a dummy example would be great. Many thanks in advance.
Maybe you could try loading it as a stars proxy. It doesn't load the file to the memory, it applies it directly to the hard drive.
https://r-spatial.github.io/stars/articles/stars2.html
Not answer for question sensu stricte, however in response to request in comment, an example of postgresql/postgis query for ST_Intersection. Based on OSM data in postgresql database imported with osm2pgsql:
WITH
highway AS (
select osm_id, way from planet_osm_line where osm_id = 332054927),
dln AS (
select osm_id, way from planet_osm_polygon where "boundary" = 'administrative'
and "admin_level" = '4' and "ref" = 'DS')
SELECT ST_Intersection(dln.way, highway.way) FROM highway, dln

R: How do I efficiently reference / calculate from a separate data frame using dplyr mutate?

Hello stackoverflow community,
Given a data frame of two columns, points of latitude and longitude (dfLatLon), I’d like to add a third column that looks up the number of crimes in a ~0.1 mile radius from a separate data frame with a list of crimes in the city of Denver (dfCrimes).
The solution below works (which I found searching stackoverflow, thank you!), but I have a hunch it’s inefficient, so I’d like to improve it if possible.
I’ve attempted to make this code reproducible (I’ve read enough posts to know that’s a must), but you will have to go to one of the sites below to download the crime data, which is ~74MB. Hope this isn’t an issue. Alternatively, you can use the other method below (uncomment it, comment the other), which doesn’t require you to download the file separately and is thus perhaps more reproducible, but I found it to be much slower (336 sec vs. 16 sec to load CSV).
Crime Web Page: http://data.denvergov.org/dataset/city-and-county-of-denver-crime
Direct Link to the Data: http://data.denvergov.org/download/gis/crime/csv/crime.csv
# Initialization Stuff
library(dplyr)
library(doParallel)
registerDoParallel(cores=4)
set.seed(77) #In honor of Red Grange … Go Illini!
#Set Working Directory
#setwd("INSERT YOUR WD HERE, LOCATION OF CRIME DATA")
#Load Crime Data
dfCrimes <- read.csv("crime.csv")
#Alternate method to obtain file, no separate download or setting working directory required, but MUCH slower.
#dfCrimes <- read.csv("http://data.denvergov.org/download/gis/crime/csv/crime.csv")
#Set Degrees per Mile Constants
cstDegPerMileLat <- 0.01450737744
cstDegPerMileLon <- 0.01882547865
#Create Lats & Lons Data Frame (a grid of ~0.1mi centers in a square around Denver)
vecLat <- seq(39.6098, 39.9146, cstDegPerMileLat * 0.1)
vecLon <- seq(-105.1100, -104.5987, cstDegPerMileLon * 0.1)
dfLatLon <- expand.grid(vecLat, vecLon)
colnames(dfLatLon) <- c("lat","lon")
#Add 3rd Column
#THIS IS THE PART THAT I THINK CAN BE MORE EFFICIENT … PLEASE HELP!
system.time(dfLatLon <- dfLatLon %>% rowwise %>% mutate(newcol = sum( dfCrimes$IS_CRIME [ ((dfCrimes$GEO_LAT - lat)*(1/cstDegPerMileLat))^2 + ((dfCrimes$GEO_LON - lon)*(1/cstDegPerMileLon))^2 < 0.1^2 ])) )
#Wrapped formula above in system.time to measure efficiency.
#At its core, the formula above is just the basic formula for a circle, x^2 + y^2 = r^2, with adjustments to convert degrees of lat/lon to miles.
This took 747 seconds (~12 minutes) to run and used only 1 processor, which is way better than the ~30 min it took to run in Excel using all 4 processors. I realize 12 minutes isn’t prohibitively long, but if I scale this solution to larger problems, it’ll matter more.
Also, I’m running R Studio in a Windows 10 OS.
Here are my specific questions:
Are there ways of using dplyr to run this more efficiently? I’ve read it’s super-efficient for this kind of problem. I suspect rowwise is a performance killer (perhaps it’s not vectorized; could it be?), but I’ve not been able to get it to run without using rowwise.
1a. Should I convert from data frames to data.tables?
1b. How can I use multiple processors to increase speed? Seems a waste to let 3 processor cores sit idle.
1c. Should I be using some kind of join or group_by? I don’t think so because there’s not a direct reference to exact lat’s and lon’s, but I’m open to this possibility if that’s the right (faster) answer.
If there’s a better way to tackle this problem without dplyr, I’d like to see it, but I’d also like to see a solution in dplyr just so I can learn it better.
Finally, please note I’m doing this analysis for the sake of learning R and data science (I don’t work for the City of Denver), and am exploring larger data sets to improve my skills. I’m a beginner to R and data science, but I’ve been an analyst doing some fairly sophisticated analyses in Excel for years. I know Excel’s limitations, and am thus fascinated by the potential of R, data science, and machine learning.
This is my first post, so hopefully I’ve covered everything. Please let me know in the comments if I’m missing something, posted in the wrong spot, violated some rule, etc.
Thanks so much for your help!!!
p.s., I realize my lats and lons aren’t evenly spaced, and the degrees of lat/lon per mile aren’t uniform either, both due to convergence toward the north pole and Earth’s not-quite-spherical shape. I’m ignoring this for now. I just want to know how to efficiently reference an ‘outside’ data frame using dplyr.
p.p.s., Eventually I plan to make a predictive model from this data, likely experimenting using caret, but for now I’m trying to improve my pre-processing skills.

Extracting point data from a large shape file in R

I'm having trouble extracting point data from a large shape file (916.2 Mb, 4618197 elements - from here: https://earthdata.nasa.gov/data/near-real-time-data/firms/active-fire-data) in R. I'm using readShapeSpatial in maptools to read in the shape file which takes a while but eventually works:
worldmap <- readShapeSpatial("shp_file_name")
I then have a data.frame of coordinates that I want extract data for. However R is really struggling with this and either loses connection or freezes, even with just one set of coordinates!
pt <-data.frame(lat=-64,long=-13.5)
pt<-SpatialPoints(pt)
e<-over(pt,worldmap)
Could anyone advise me on a more efficient way of doing this?
Or is it the case that I need to run this script on something more powerful (currently using a mac mini with 2.3 GHz processor)?
Many thanks!
By 'point data' do you mean the longitude and latitude coordinates? If that's the case, you can obtain the data underlying the shapefile with:
worldmap#data
You can view this in the same way you would any other data frame, for example:
View(worldmap#data)
You can also access columns in this data frame in the same way you normally would, except you don't need the #data, e.g.:
worldmap$LATITUDE
Finally, it is recommended to use readOGR from the rgdal package rather than maptools::readShapeSpatial as the former reads in the CRS/projection information.

Dissolved polygons using R not plotting correctly

I'm trying to perform a dissolve in R. I've previously done this in QGIS but I want to achieve this in R to integrate with the rest of my workflow if possible.
I have an ESRI shapefile with small geographical polygons (output areas, if you're familiar with UK census geography). I also have a lookup table provided to me with a list of all OA codes with their associated aggregated geography code.
I can't provide the actual files I'm working on, but comparable files and a minimal reproducable example below:
https://www.dropbox.com/s/4puoof8u5btigxq/oa-soa.csv?dl=1 (130kb csv)
https://www.dropbox.com/s/xqbi7ub2122q14r/soa.zip?dl=1 (~4MB shp)
And code:
require("rgdal") # for readOGR
require("rgeos") # for gUnion
require("maptools")
unzip("soa.zip")
soa <- readOGR(dsn = "soa", "england_oac_2011")
proj4string(soa) <- CRS("+init=epsg:27700") # British National Grid
lookup <- read.csv("oa-soa.csv")
slsoa <- gUnaryUnion(soa, id = lookup$LSOA11CD)
I've also tried:
slsoa <- unionSpatialPolygons(soa, lookup$$LSOA11CD)
but my understanding is that since I have (R)GEOS installed this uses the gUnion methods from the rgeos package anyway.
So, my problem is that the dissolve appears to work; I don't get an error message and the length() function suggests I now have fewer polygons:
length(soa#polygons) # 1,817
length(slsoa#polygons) # should be 338
but the plots appear to be the same (i.e. the internal dissolves haven't worked), as demonstrated by the following two plots:
plot(soa)
plot(slsoa)
I've looked around on the internet and stackoverflow to see if I can solve my issue and found several articles but without success.
problems when unioning and dissolving polygons in R (I don't think the quality of the shapefile is the problem because I'm using a lookup table to match geographies).
https://www.nceas.ucsb.edu/scicomp/usecases/PolygonDissolveOperationsR (uses two sp objects, not lookup table).
https://gis.stackexchange.com/questions/93441/problem-with-merging-and-dissolving-a-shapefile-in-r (as far as I can tell I've followed the relevant steps)
Does anyone have any idea what I'm doing wrong and why the plots aren't working correctly?
Thanks muchly.
First, your soa shapefile has 1817 elements, each with a unique code (corresponding to lookup$OA11CD). But your lookup file has only 1667 rows. Obviously, lookup does not have "a list of all OA codes".
Second, unless lookup has the same codes as your shapefile in the same order, using gUnaryUnion(...) this way will yield garbage. You need to merge soa#data with lookup on the corresponding fields first.
Third, gUnaryUnion(...) cannot remove internal boundaries if the polygons are not contiguous (obviously).
This seems to work
soa <- merge(soa,lookup,by.x="code",by.y="OA11CD",all.x=TRUE)
slsoa <- gUnaryUnion(soa,id=soa$LSOA11CD)
length(slsoa)
# [1] 338
par(mfrow=c(1,2),mar=c(0,0,0,0))
plot(soa);plot(slsoa)

Adding extra data column to shapefile using convert.to.shapefile in R's shapefiles package

My goal is very simple, namely to add 1 column of statistical data to a shapefile so that I can use it for example to colour a geographical area. The data are a country file from gadm. To this end I usually use the foreign package in R thus:
library(foreign)
newdbf <- read.dbf("CHN_adm1.dbf") #original shape file
incrdata <- read.csv("CHN_test.csv") #.csv file with same region names column + new data column
mergedbf <- merge(newdbf,incrdata)
write.dbf(mergedbf,"CHN_New")
This achieves what I want in almost all circumstances, but one of the pieces of software I am dealing with external to R will only recognize .shp files and will not read .dbf (although clearly in a sense that statement is a slight contradiction). Not sure why it won't. Anyhow, essentially it leaves me needing to do the same thing as above, but with a shapefile. I think that according to notes on shapefiles package, the process should run something like this:
library(shapefiles)
shaper <- read.shp("CHN_adm1.shp")
simplified <- convert.to.simple(shaper)
simplified <- change.id(simplified,incrdata$DataNew) #DataNew being new column of data from the .csv
simpleAsList <- by(simplified,simplified[,1],function(x)x)
####This is where I hit problems####
backToShape <- convert.to.shapefile(simplified,
data.frame(index=c("20","30","40","50","60","70","80")),"index",5)
write.shapefile(backToShape,"CHN_TestShape")
I'm afraid that I can't get my head around shapefiles, since I can't unpick them or visualize them in a way I can with dataframes, and so the resultant shape has been screwed up when it goes back to the external charting package.
To be clear: in 'backToShape' I just want to add the column of data and reconstruct the shapefile. It so happens that the data I have appears as a factor, ie 20,30,40 etc, but the data could just as easily be continuous, and I'm sure I don't need to type in all possibilities, but it was the only way I could seem to get it to be accepted. Can somebody please put me on the right track, and if I'm missing a simpler way, I'd also be extremely grateful to hear a suggestion. Many thanks in advance.
Stop using the shapefiles package.
Install the sp and rgdal packages.
Read shapefile with:
chn = readOGR(".","CHN_adm1") # first arg is path, second is shapefile name w/o .shp
Now chn is like a data frame. In fact chn#data is a data frame. Do what you like to that data frame but keep it in the same order, and then you can save the updated shapefile with the new data by:
writeOGR(chn, ".", "CHN_new", driver="ESRI Shapefile")
Note you shouldn't really manipulate the chn#data data frame directly, you can work with chn like it is a data frame in many respects, for example chn$foo gets the column named foo, or chn$popden = chn$pop/chn$area would create a new column of population density if you have population and area columns.
spplot(chn, "popden")
will map by the popden column you just created, and:
head(as.data.frame(chn))
should show you the first few lines of the shapefile data.

Resources