Parallelize st_union from R's sf package

Parallelize st_union from R's sf package - r

I have some large shapefiles with multiple millions of polygons that I need to dissolve. Depending upon the shapefile I need to either dissolve by group or just use st_union for all. I have been using the st_par function and it has been working great for most sf applications. Though when I use this function on st_union it returns a list and I cannot figure out how to parallize the sf dissolve function st_union.
Any suggestions would be most helpful! Here is a small code snippet to illustrate my point.
library(sf)
library(assertthat)
library(parallel)
us_shp <- "data/cb_2016_us_state_20m/cb_2016_us_state_20m.shp"
if (!file.exists(us_shp)) {
loc <- "https://www2.census.gov/geo/tiger/GENZ2016/shp/cb_2016_us_state_20m.zip"
dest <- paste0("data/cb_2016_us_state_20m", ".zip")
download.file(loc, dest)
unzip(dest, exdir = "data/cb_2016_us_state_20m")
unlink(dest)
assert_that(file.exists(us_shp))
}
usa <- st_read("data/cb_2016_us_state_20m/cb_2016_us_state_20m.shp", quiet= TRUE) %>%
filter(!(STUSPS %in% c("AK", "HI", "PR")))
test <- usa %>%
st_par(., st_union, n_cores = 2)

I think you can solve your specific problem with a small modification of the original st_par function.
However this is just a quick and bold fix and this might broke the code for other uses of the function.
The author of the function could certainly provide a better fix...
library(parallel)
# Paralise any simple features analysis.
st_par <- function(sf_df, sf_func, n_cores, ...){
# Create a vector to split the data set up by.
split_vector <- rep(1:n_cores, each = nrow(sf_df) / n_cores, length.out = nrow(sf_df))
# Perform GIS analysis
split_results <- split(sf_df, split_vector) %>%
mclapply(function(x) sf_func(x), mc.cores = n_cores)
# Combine results back together. Method of combining depends on the output from the function.
if ( length(class(split_results[[1]]))>1 | class(split_results[[1]])[1] == 'list' ){
result <- do.call("c", split_results)
names(result) <- NULL
} else {
result <- do.call("rbind", split_results)
}
# Return result
return(result)
}

I was trying to use this for st_join and was running into problems with the returned data type. In looking at the result more closely it became evident that the split_results was just a list of sf objects. I ended up modifying the code to use dplyr::bind_rows() to get what I wanted.
There probably needs to be some more logic around the "combine" to deal with different return types but this works for the st_join function.
# Parallelise any simple features analysis.
st_par <- function(sf_df, sf_func, n_cores, ...) {
# Create a vector to split the data set up by.
split_vector <- rep(1:n_cores, each = nrow(sf_df) / n_cores, length.out = nrow(sf_df))
# Perform GIS analysis
split_results <- split(sf_df, split_vector) %>%
mclapply(function(x) sf_func(x, ...), mc.cores = n_cores)
# Combine results back together. Method of combining probably depends on the
# output from the function. For st_join it is a list of sf objects. This
# satisfies my needs for reverse geocoding
result <- dplyr::bind_rows(split_results)
# Return result
return(result)
}

Related

How to create a subset of a shp file, with all its properties

I am new to programming in R and with .shp files.
I am trying to take a subsample / subset of a .shp file that is so big, you can download this file from here: https://www.ine.es/ss/Satellite?L=es_ES&c=Page&cid=1259952026632&p=1259952026632&pagename=ProductosYServicios%2FPYSLayout (select the year 2021 and then go ahead).
I have tried several things but none of them work, neither is it worth passing it to sf because it would simply add one more column called geometry with the coordinates listed and that is not enough for me to put it later in the leaflet package.
I have tried this here but it doesn't work for me:
myspdf = readOGR(getwd(), layer = "SECC_CE_20210101") #It works
PV2 = myspdf[myspdf#data$NCA == 'País Vasco', ] #Dont work
PV2 = myspdf[,myspdf#data$NCA == 'País Vasco'] #Dont work
What I intend is to create a sample of myspdf (with data, polygons, plotorder, bbox and proj4string) but I don't want it from all the NCA values (myspdf#data$NCA), I only want those in which data$NCA are 'País Vasco'
In short, I would like to have a sample for each value of the different NCA column.
Is that possible? someone can help me on this? thank you very much.
I have tried this too but the same thing as before appears to me, all 18 variables appear and all are empty:
Pais_V = subset(myspdf, NCA == 'País Vasco')
dim(Pais_V)

Here's one approach:
library(rgdal)
dlshape=function(shploc, shpfile) {
temp=tempfile()
download.file(shploc, temp)
unzip(temp)
shp.data <- sapply(".", function(f) {
fp <- file.path(temp, f)
return(readOGR(dsn=".",shpfile))
})
}
setwd("C:/temp")
x = dlshape(shploc="https://www2.census.gov/geo/tiger/GENZ2020/shp/cb_2020_us_aitsn_500k.zip", "cb_2020_us_aitsn_500k")
x<-x$. # extract the shapefile
mycats<-c("00","T2","T3","28")
x2<-subset(x, x$LSAD %in% mycats) # subset using the list `mycats`
mypal=colorFactor("Dark2",domain=x2$LSAD)
library(leaflet)
leaflet(x2) %>% addPolygons(weight=.2, color=mypal(x2$LSAD))
dlshape function courtesy of #yokota

Here's another option. This uses the package sf.
myspdf <- st_read("./_data/España_Seccionado2021/SECC_CE_20210101.shp",
as_tibble = T)
Now you can filter this data any way that you filter a data frame. It will still work as spatial data, as well.
Using tidyverse (well, technically dplyr):
myspdf %>% filter(NCA == "País Vasco")
This takes it from 36,334 observations to 1714 observations.
The base R method you tried to use with readOGR will work, as well.
myspdf[myspdf$NCA == "País Vasco",]

Looping 'raster' package functions over each multi-polygon feature in R

I am trying to loop several functions from the 'raster' package, namely, crop(), mask(), reclassify() and unstack()/as.list(). I have ten raster layers that share the same extent and data type; they correspond to land cover over 10 time-points. I want to create individual list variables for each output of crop() -> mask() -> reclassify() -> as.list(). I was able to pipe the process for 1 polygon feature, but I need to be able to loop it for each of the 10 polygon features stored in the multipolygon Shapefile, such that I can save each output list according to a specified naming convention.
Thank you and please advise. I share my code below.
EDIT: I am wondering if a for-loop is the right way to go about this, or would an lapply approach be better?
# Load libraries
library(raster) # for raster processing
library(rgdal) # for raster/vector processing
library(sf) # for Shapefile processing
# Stack 10 rasters together
raster.stack = stack(
raster("path/raster1.tif"),
raster("path/raster2.tif"),
raster("path/raster3.tif"),
raster("path/raster4.tif"),
raster("path/raster5.tif"),
raster("path/raster6.tif"),
raster("path/raster7.tif"),
raster("path/raster8.tif"),
raster("path/raster9.tif"),
raster("path/raster10.tif")
)
# Prepare reclassification codes from 9-class raster to 3-class raster
reclasscodes = c(
0,0, # no data
1,1,
2,1,
3,1,
4,1,
5,2,
6,2,
7,3,
8,3,
9,3
)
# Convert reclass codes list into n x 2 matrix
reclassmatrix = matrix(reclasscodes, ncol=2, byrow = T)
# Load multipolygon vector Shapefile
multipolygon = shapefile("path/multipolygon.shp") # Shapefile is made of n polygons
# Example subset Shapefile to polygon_1 using attribute "ID"
polygon_1 = subset(multipolygon,ID=="D-4")
# Create output for polygon_1
list_polygon_1 =
raster.stack %>%
crop(y = polygon_1) %>% # crop to bounds
mask(mask = polygon_1) %>% # mask to polygon cutline
reclassify(rcl = reclassmatrix) %>% # reclassify to 3-class
as.list() # functions the same as unstack() where raster brick is converted to list of raster layers
# I use %>% because I do not want to save any of the intermediate outputs.
# Resultant output is a variable list for polygon_1 named 'list_polygon_1' which is exactly what I want.
# Worked perfectly.
# How do I repeat this process for polygon_1 to polygon n?
# My attempt
for (i in 1:nrow(multipolygon)) {
raster.stack %>%
crop(y = multipolygon[i,]) %>%
mask(mask = multipolygon[i,]) %>%
reclassify(rcl = reclassmatrix) %>%
as.list() %>% # up till here it is the same steps as before for polygon_1
# now I want to save each list output as a separate variable according to i, e.g. list_polygon_2, list_polygon_3 etc.
assign(paste(multipolygon$ID, i, sep = '_')) # assign a naming convention for each output variable
}
# Does not work. Even without the last line of code "..assign(paste(...))" there is no output variable from the as.list() line.

Here is a minimal self-contained reproducible example.
Example data
library(raster)
s <- stack(system.file("external/rlogo.grd", package="raster"))
xy1 <- xy2 <- xy3 <- matrix(c(10,17, 6,10,71,60,62,71), ncol=2)
xy2[,1] <- xy2[,1] + 30
xy3[,2] <- xy3[,2] - 30
p <- spPolygons(xy1, xy2, xy3)
#plot(r, 1)
#lines(p)
What you are after
rm = matrix(c(0,100,0,100,150,2,150,255,3), ncol=3, byrow=TRUE)
out <- list()
for (i in 1:length(p)) {
x <- crop(s, p[i,])
x <- mask(x, p[i,])
out[[i]] <- reclassify(x, rm)
}
What you are saying about unstack does not make sense (and unlist does not work). I would advise against it, but you could do
out2 <- lapply(out, unstack)
I am not sure what you are really after. If you want the cell values you can make it much simpler (no need for a loop) and do
r <- reclassify(s, rm)
e <- extract(r, p)
To your question about lapply vs a loop. In terms of performance that rarely matters. lapply can be concise, but in cases like this, writing a loop is better as it is easier to read, and write, especially if you do not use %>%.

R, script/function for retrieving more stocks

I'm a newbye in R and I've seen several posts about downloading more stocks, but for a reason or another they don't work as suggested.
My purpose is to download a vector of stocks and create a whole xts-matrix containing only Close prices for every stock (so a nobservations x 3 columns).
Anyway, I'd like to start from a basic script that doesn't work properly:
library(quantmod)
ticker=c("KO","AAPL","^GSPC")
for (i in 1:length(ticker)) {
simbol=as.xts(na.omit(getSymbols(ticker[i],from="2016-01-01",auto.assign=F)))
new=Cl(simbol)
merge(new[i])
}
It would be even better to write a function(symbols) that allows me to call whenever I need to just change the name of the stocks to download.
Thanks to everyone

This is how I would do what you want with a function wrapper (which is a pretty common kind of manipulation with xts):
ticker=c("KO","AAPL","^GSPC")
collect_close_series <- function(ticker) {
# Preallocate a list to store the result from each loop iteration (Note: lapply is another alternative to a direct loop)
lst <- vector("list", length(ticker))
for (i in 1:length(ticker)) {
symbol <- na.omit(getSymbols(ticker[i],from="2016-01-01",auto.assign = FALSE))
lst[[i]] <- Cl(symbol)
}
# You have a list of close prices. You can combine the objects in the list compactly using do.call; this is a common "data manipulation pattern" with xts objects.
rr <- do.call(what = merge, lst)
rr
}
out <- collect_close_series(ticker)
More advanced (better code design): You could write cleaner code by writing a function that handles each symbol (rather than a function that wraps and passes in all the symbols together) and then run lapply on it:
per_sym_close <- function(tick) {
symbol <- na.omit(getSymbols(tick,from="2016-01-01",auto.assign = FALSE))
Cl(symbol)
}
out2 <- do.call(merge, lapply(X = ticker, FUN = per_sym_close))
This gives the same result.
Hope this helps getting you started toward writing good R code!

Loop polygon clipping with gIntersection

Question
I am trying to perform a number of polygon clips using the gIntersection function with R in a loop. I can obtain the correct clips and re-enter data manually (so I can turn the resulting SpatialPolygons object back into a SpatialPolygonsDataFrame object). What I can't do is get this working in a loop with for() or apply().
At the moment this isn't a problem. I have 9 English regions (with London), so it's not a huge challenge to set each clip up manually. But, I want to eventually clip LSOAs in LADs, which essentially means setting up >400 clips.
So, my question is, how do I turn my manual clips into a working loop?
Minimal Reproducible Example
To keep things simple, let's use the English regions (n = 9). For each of the 9 regions, I'm going to clip the counties. The following code loads the appropriate shapefiles and reprojects them as British National Grid:
require(rgdal)
require(rgeos)
# English counties shapefile (~ 10MB zipped)
download.file(
"https://dl.dropboxusercontent.com/s/6o0mi28pjo1kh9k/england-counties.zip",
"ec", method = "wget")
unzip("ec")
ec <- readOGR("england-counties", "england_ct_2011")
proj4string(ec) <- CRS("+init=epsg:27700")
# English regions (~6MB zipped)
download.file(
"https://dl.dropboxusercontent.com/s/p69m0vk2fh4xe3o/england-regions-2011.zip",
"er", method = "wget")
unzip("er")
er <- readOGR("england-regions-2011", "England_gor_2011")
proj4string(er) <- CRS("+init=epsg:27700")
You should be left with two objects, er (English regions) and ec (English counties). Both are SpatialPolygonsDataFrame objects.
Taking the first region - East of England E12000006 - let's clip the counties and turn the result back in to a SpatialPolygonsDataFrame object:
ee <- gIntersection(ec, er[er$CODE == "E12000006", ],
byid = T, drop_not_poly = T)
row.names(ee) <- as.character(gsub(" 0", "", row.names(ee)))
# gIntersection adds ' 0' to each row.name?
ee <- SpatialPolygonsDataFrame(ee, ec#data[row.names(ee), ])
A plot of ee confirms this worked:
As you can see, this is a nice workflow for just a few shapes, but I really want to loop through all regions and, ultimately, many more polygons.
What I've Tried
I'm not very good with apply() loops, so what I've tried so far is a for() loop (which I know is relatively slow, but still quicker than typing everything out!):
regions <- as.character(er$CODE) # length = 9 as expected
for(i in 1:length(regions)){
as.name(paste0(regions[i], "c")) <-
gIntersection(ec, er[er$CODE == regions[1], ], byid = T, drop_not_poly = T)
}
Rather than the expected behaviour I get the following error:
Error in as.name(paste0(regions[1], "c")) <- gIntersection(ec, er[er$CODE == :
could not find function "as.name<-"
I also tried wrapping the object name in an eval() but get the following error:
Error in eval(as.name(paste0(regions[1], "c"))) <- gIntersection(ec, er[er$CODE == :
could not find function "eval<-"
What am I missing?
In addition to the gIntersection, I would like to re-create a SpatialPolygonsDataFrame object if possible. I've tried the following code, having done one gIntersection manually, but again it doesn't work:
for(i in 1:length(regions)){
row.names(as.name(paste0(regions[i], "c"))) <- as.character(gsub(" 0", "",
row.names(as.name(paste0(regions[i], "c")))))
}
I get the following error:
Error in `rownames<-`(x, value) :
attempt to set 'rownames' on an object with no dimensions
I'm also not sure how to increment the " 0", as this increases by one for each new region (" 1", " 2", etc.)
Again, setting the first example up manually I also can't perform the final SpatialPolygonsDataFrame step:
for(i in 1:length(regions)){
as.name(regions[i]) <- SpatialPolygonsDataFrame(regions[i],
ec#data[row.names(regions[i], )])
}
For this I get the following error:
Error in stopifnot(length(Sr#polygons) == nrow(data)) :
trying to get slot "polygons" from an object of a basic class ("character") with no
slots
Where I've looked
The following SO examples are related by do not seem to help, or at least I can't see how I would make them apply to this example:
rgeos gIntersection in loop takes too long to clip path network
How to clip WorldMap with polygon in R?
https://gis.stackexchange.com/questions/33278/no-intersection-found-between-polygons-i-know-intersect
Thanks for taking the time to read this.

Does this help?
ee <- lapply(regions, function(x)
gIntersection(ec, er[er$CODE == x, ], byid = TRUE, drop_not_poly = TRUE))
This gives you a list of SpatialPolygonsDataFrames, one for each region. Which you can access in the usual way, e.g.
ee[[1]]
plot(ee[[1]]) # to plot the first region with counties
Edit
Your orignial code should work with a sligh modification (see blow).
res <- list()
for (i in 1:length(regions)) {
ee <- gIntersection(ec, er[er$CODE == regions[i], ],
byid = TRUE, drop_not_poly = TRUE)
row.names(ee) <- as.character(gsub(paste0(" ", i-1), "", row.names(ee)))
ee <- SpatialPolygonsDataFrame(ee, ec#data[row.names(ee), ])
res[[i]] <- ee
}
If that solves the problem, then the problem was, that row names of ee always incremented by one and you did not account for this.

Merging multiple rasters in R

I've been trying to find a time-efficient way to merge multiple raster images in R. These are adjacent ASTER scenes from the southern Kilimanjaro region, and my target is to put them together to obtain one large image.
This is what I got so far (object 'ast14dmo' representing a list of RasterLayer objects):
# Loop through single ASTER scenes
for (i in seq(ast14dmo.sd)) {
if (i == 1) {
# Merge current with subsequent scene
ast14dmo.sd.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
} else if (i > 1 && i < length(ast14dmo.sd)) {
tmp.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
ast14dmo.sd.mrg <- merge(ast14dmo.sd.mrg, tmp.mrg, tolerance = 1)
} else {
# Save merged image
writeRaster(ast14dmo.sd.mrg, paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg", sep = ""), format = "GTiff", overwrite = TRUE)
}
}
As you surely guess, the code works. However, merging takes quite long considering that each single raster object is some 70 mb large. I also tried Reduce and do.call, but that failed since I couldn't pass the argument 'tolerance' which circumvents the different origins of the raster files.
Anybody got an idea of how to speed things up?

You can use do.call
ast14dmo.sd$tolerance <- 1
ast14dmo.sd$filename <- paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg.tif", sep = "")
ast14dmo.sd$overwrite <- TRUE
mm <- do.call(merge, ast14dmo.sd)
Here with some data, from the example in raster::merge
r1 <- raster(xmx=-150, ymn=60, ncols=30, nrows=30)
r1[] <- 1:ncell(r1)
r2 <- raster(xmn=-100, xmx=-50, ymx=50, ymn=30)
res(r2) <- c(xres(r1), yres(r1))
r2[] <- 1:ncell(r2)
x <- list(r1, r2)
names(x) <- c("x", "y")
x$filename <- 'test.tif'
x$overwrite <- TRUE
m <- do.call(merge, x)

The 'merge' function from the Raster package is a little slow. For large projects a faster option is to work with gdal commands in R.
library(gdalUtils)
library(rgdal)
Build list of all raster files you want to join (in your current working directory).
all_my_rasts <- c('r1.tif', 'r2.tif', 'r3.tif')
Make a template raster file to build onto. Think of this a big blank canvas to add tiles to.
e <- extent(-131, -124, 49, 53)
template <- raster(e)
projection(template) <- '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs'
writeRaster(template, file="MyBigNastyRasty.tif", format="GTiff")
Merge all raster tiles into one big raster.
mosaic_rasters(gdalfile=all_my_rasts,dst_dataset="MyBigNastyRasty.tif",of="GTiff")
gdalinfo("MyBigNastyRasty.tif")
This should work pretty well for speed (faster than merge in the raster package), but if you have thousands of tiles you might even want to look into building a vrt first.

You can use Reduce like this for example :
Reduce(function(...)merge(...,tolerance=1),ast14dmo.sd)

SAGA GIS mosaicking tool (http://www.saga-gis.org/saga_tool_doc/7.3.0/grid_tools_3.html) gives you maximum flexibility for merging numeric layers, and it runs in parallel by default! You only have to translate all rasters/images to SAGA .sgrd format first, then run the command line saga_cmd.

I have tested the solution using gdalUtils as proposed by Matthew Bayly. It works quite well and fast (I have about 1000 images to merge). However, after checking with document of mosaic_raster function here, I found that it works without making a template raster before mosaic the images. I pasted the example codes from the document below:
outdir <- tempdir()
gdal_setInstallation()
valid_install <- !is.null(getOption("gdalUtils_gdalPath"))
if(require(raster) && require(rgdal) && valid_install)
{
layer1 <- system.file("external/tahoe_lidar_bareearth.tif", package="gdalUtils")
layer2 <- system.file("external/tahoe_lidar_highesthit.tif", package="gdalUtils")
mosaic_rasters(gdalfile=c(layer1,layer2),dst_dataset=file.path(outdir,"test_mosaic.envi"),
separate=TRUE,of="ENVI",verbose=TRUE)
gdalinfo("test_mosaic.envi")
}

I was faced with this same problem and I used
#Read desired files into R
data_name1<-'file_name1.tif'
r1=raster(data_name1)
data_name2<-'file_name2.tif'
r2=raster(data_name2)
#Merge files
new_data <- raster::merge(r1, r2)
Although it did not produce a new merged raster file, it stored in the data environment and produced a merged map when plotted.

I ran into the following problem when trying to mosaic several rasters on top of each other
In vv[is.na(vv)] <- getValues(x[[i]])[is.na(vv)] :
number of items to replace is not a multiple of replacement length
As #Robert Hijmans pointed out, it was likely because of misaligned rasters. To work around this, I had to resample the rasters first
library(raster)
x <- raster("Base_raster.tif")
r1 <- raster("Top1_raster.tif")
r2 <- raster("Top2_raster.tif")
# Resample
x1 <- resample(r1, crop(x, r1))
x2 <- resample(r2, crop(x, r2))
# Merge rasters. Make sure to use the right order
m <- merge(merge(x1, x2), x)
# Write output
writeRaster(m,
filename = file.path("Mosaic_raster.tif"),
format = "GTiff",
overwrite = TRUE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Parallelize st_union from R's sf package - r

Related

How to create a subset of a shp file, with all its properties

Looping 'raster' package functions over each multi-polygon feature in R

R, script/function for retrieving more stocks

Loop polygon clipping with gIntersection

Merging multiple rasters in R

Categories

Resources