Efficient spatial joining for large dataset in r - r

I am working with rather large data.frames that i often need to do spatial join on. The fastest way I have come up with so far is this method:
library(rgdal)
download.file("http://gis.ices.dk/shapefiles/ICES_ecoregions.zip",
destfile = "ICES_ecoregions.zip")
unzip("ICES_ecoregions.zip")
# read eco region shapefiles
ices_eco <- rgdal::readOGR(".", "ICES_ecoregions_20150113_no_land", verbose = FALSE)
## Make a large data.frame (361,722 rows) with positions in the North Sea:
lon <- seq(-18.025, 32.025, by=0.05)
lat <- seq(48.025, 66.025, by=0.05)
grd <- expand.grid(lon=lon, lat=lat)
# Get the Ecoregion for each position
pings <- SpatialPoints(c[c('lon','lat')],proj4string=ices_eco#proj4string)
grd$area <- over(pings,ices_eco)$Ecoregion
But this takes a very long time and uses a lot of RAM, and will sometime come up with the Error: cannot allocate vector of size 460 Kb (if you cant reproduce the error, just make c larger). Anyone can come up with a better/faster/more efficient solution?

Related

How to streamline and speed up loop with getData package in R

I am trying to download high-resolution climate data for a bunch of lat/long coordinates, and combine them into a single dataframe. I've come up with a solution (below), but it will take forever with the large list of coordinates I have. I asked a related question on the GIS StackExchange to see if anyone knew of a better approach for downloading and merging the data, but I'm wondering if I could somehow just speed up the operation of the loop? Does anyone have any suggestions on how I might do that? Here is a reproducible example:
# Download and merge 0.5 minute MAT/MAP data from WorldClim for a list of lon/lat coordinates
# This is based on https://emilypiche.github.io/BIO381/raster.html
# Make a dataframe with coordinates
coords <- data.frame(Lon = c(-83.63, 149.12), Lat=c(10.39,-35.31))
# Load package
library(raster)
# Make an empty dataframe for dumping data into
coords3 <- data.frame(Lon=integer(), Lat=integer(), MAT_10=integer(), MAP_MM=integer())
# Get WorldClim data for all the coordinates, and dump into coords 3
for(i in seq_along(coords$Lon)) {
r <- getData("worldclim", var="bio", res=0.5, lon=coords[i,1], lat=coords[i,2]) # Download the tile containing the lat/lon
r <- r[[c(1,12)]] # Reduce the layers in the RasterStack to just the variables we want to look at (MAT*10 and MAP_mm)
names(r) <- c("MAT_10", "MAP_mm") # Rename the columns to something intelligible
points <- SpatialPoints(na.omit(coords[i,1:2]), proj4string = r#crs) #give lon,lat to SpatialPoints
values <- extract(r,points)
coords2 <- cbind.data.frame(coords[i,1:2],values)
coords3 <- rbind(coords3, coords2)
}
# Convert MAT*10 from WorldClim into MAT in Celcius
coords3$MAT_C <- coords3$MAT_10/10
Edit: Thanks to advice from Dave2e, I've first made a list, then put intermediate results in the list, and rbind it at the end. I haven't timed this yet to see how much faster it is than my original solution. If anyone has further suggestions on how to improve the speed, I'm all ears! Here is the new version:
coordsList <- list()
for(i in seq_along(coordinates$lon_stm)) {
r <- getData("worldclim", var="bio", res=0.5, lon=coordinates[i,7], lat=coordinates[i,6]) # Download the tile containing the lat/lon
r <- r[[c(1,12)]] # Reduce the layers in the RasterStack to just the variables we want to look at (MAT*10 and MAP_mm)
names(r) <- c("MAT_10", "MAP_mm") # Rename the columns to something intelligible
points <- SpatialPoints(na.omit(coordinates[i,7:6]), proj4string = r#crs) #give lon,lat to SpatialPoints
values <- extract(r,points)
coordsList[[i]] <- cbind.data.frame(coordinates[i,7:6],values)
}
coords_new <- bind_rows(coordsList)
Edit2: I used system.time() to time the execution of both of the above approaches. When I did the timing, I had already downloaded all of the data, so the download time isn't included in my time estimates. My first approach took 45.01 minutes, and the revised approach took 44.15 minutes, so I'm not really seeing a substantial time savings by doing it the latter way. Still open to advice on how to revise the code so I can improve the speed of the operations!

Custom spatial processing function consumes a lot of memory as code runs in R

I have several rasters, 343 to be more exact, from Cropscape. I need to get the locations (centroids) and area measurements of pixels that represent potatoes and tomatoes based on the associated values in the rasters. The pixel values are 43 and 54, respectively. Cropscape provides rasters separated by year and state, except for 2016, which has the lower 48 states combined. The rasters are saved as GeoTiffs on a Google Drive and I am using Google File Stream to connect to the rasters locally.
I want to create a SpatialPointsDataFrame from the centroids of each pixel or group of adjacent pixels for tomatoes and potatoes in all the rasters. Right now, my code will
Subset the rasters to potatoes and tomatoes
Change the raster subsets to polygons, one for potatoes and one for tomatoes
Create centroids from each polygon
Create a SpatialPointsDataFrame based on the centroids
Extract the area measurement for each area of interest with SpatialPointsDataFrame
Write the raster subsets and each polygon to a file.
Code:
library(raster)
library(rgdal)
library(rgeos)
dat_dir2 = getwd()
mepg <- make_EPSG()
ae_pr <- mepg[mepg$code == "5070", "prj4"]
# Toy raster list for use with code
# I use `list.files()` with the directories that hold
# the rasters and use list that is generated from
# that to read in the files to raster. My list is called
# "tiflist". Not used in the code, but mentioned later.
rmk1 <- function(x, ...) {
r1 = raster(ncol = 1000, nrow = 1000)
r1[] = sample(1:60, 1000000, replace = T)
proj4string(r1) = CRS(ae_pr)
return(r1)
}
rlis <- lapply(1:5, rmk1)
#Pixel values needed
ptto <- c(43, 54)
# My function to go through rasters for locations and area measurements.
# This code is somewhat edited to work with the demo raster list.
# It produces the same output as what I wanted, but with the demo list.
pottom <- function(x, ...) {
# Next line is not necessary with the raster list created above.
# temras = raster(x)
now = format(Sys.time(), "%b%d%H%M%S")
nwnm = paste0(names(x), now)
rasmatx = match(x = x, table = ptto)
writeRaster(rasmatx, file.path( dat_dir2, paste0(nwnm,"ras")), format = "GTiff")
tempol = rasterToPolygons(rasmatx, fun = function(x) { x > 0 & x < 4}, dissolve = T)
tempol2 = disaggregate(tempol)
# for potatoes
tempol2p = tempol2[tempol2$layer == '1',]
if (nrow(tempol2p) > 0) {
temcenp = gCentroid(tempol2p, byid = T)
temcenpdf = SpatialPointsDataFrame(temcenp, data.frame(ID = 1:length(temcenp) , temcenp))
temcenpdf$pot_p = extract(rasmatx, temcenpdf)
temcenpdf$areap_m = gArea(tempol2p, byid = T)
# writeOGR(temcenpdf, dsn=file.path(dat_dir2), paste0(nwnm, "p"), driver = "ESRI Shapefile")
}
# for tomatoes
tempol2t = tempol2[tempol2$layer == '2',]
if (nrow(tempol2t) > 0) {
temcent = gCentroid(tempol2t, byid = T)
temcentdf = SpatialPointsDataFrame(temcent, data.frame(ID = 1:length(temcent) , temcent))
temcentdf$tom_t = extract(rasmatx, temcentdf)
temcentdf$areat_m = gArea(tempol2t, byid = T)
writeOGR(temcentdf, dsn=file.path(dat_dir2), paste0(nwnm,"t"), driver = "ESRI Shapefile")
}
}
lapply(rlis, pottom)
I know I should provide some toy data and I created some, but I don't know if they exactly recreate my problem, which follows.
Besides my wonky code, which seems to work, I have a bigger problem. A lot of memory is used when this code runs. The tiflist can only get through the first 4 files of the list and by then RAM, which is 16 GB on my laptop, is completely consumed. I'm pretty sure it's the connections to the Google Drive, since the cache for the drive stream is at least 8 GB. I guess each raster is staying open after being connected to in the Google Drive? I don't know how to confirm that.
I think I need to get the function to clear out all of the objects that are created, e.g. temras, rasmatx, tempol, etc., after processing each raster, but I'm not sure how to do that. I did try adding rm(temras ...) to the end of the function, but when I did that, there was no output at all from the function after 10 minutes and by then, I've usually got the first 3 rasters processed.
27/Oct EDIT after comments from RobertHijmans. It seems that the states with large geographic extents are causing problems with rasterToPolygons(). I edited the code from the way it works for me locally to work with the demo data I included, since RobertHijmans pointed out it wasn't functional. So I hope this is now reproducible.
I feel silly answering my own question, but here it is: the rasterToPolygons function is notoriously slow. I was unaware of this issue. I waited 30 minutes before killing the process with no result in one of my attempts. It works on the conditions I require for rasters for Alabama and Arkansas for example, but not California.
A submitted solution, which I am in the process of testing, comes from this GitHub repo. The test is ongoing at 12 minutes, so I don't know if it works for an object as large as California. I don't want to copy and paste someone else's code in an answer to my own question.
One of the comments suggested using profvis, but I couldn't really figure out the output. And it hung with the process too.

Rasterizing and mosaicking a large (~15GB) CSV in parallel

I'm trying to rasterize a large CSV file which contains X and Y coordinates of raster pixels and an ID representing different individuals that have moved through the pixel.
I want 2 rasters from this operation: In the first, each pixel of the raster contains the number of times individuals have crossed it and in the second, each pixel contains the number of unique individuals that have crossed it.
I currently create rasters for each ID and then mosaic the rasters. I've parallelized this operation, using 16 cores and a CPU with 200 GB RAM. However, the CSV contains over a million rows and it seems a large overhead to copy it into each core. It also takes upwards of 4 days to complete and I have multiple such CSV files.
Is there a better, faster way to do this?
This is what I do currently:
n.cores <- 16 #this is half of the available cores
#Register CoreCluster
cl <- makeCluster(n.cores)
registerDoSNOW(cl)
clusterExport(cl=cl, "fil2" ) #fil2 is the CSV
#uq_id is a vector of all unique IDs
rasterList= foreach(i=1:length(uq_id), .combine='comb',
.init=list()) %dopar% {
library(raster)
#create a raster for each unique ID
dfXYID <- subset(fil2, uqid %in% i)
dfXYID$uqid <- as.numeric(dfXYID$uqid)
dfXYID=as.data.frame(dfXYID)
x <- raster(xmn=min(dfXYID$V1), xmx=max(dfXYID$V1), ymn=min(dfXYID$V2), ymx=max(dfXYID$V2), res=220, crs="+proj=utm +zone=43 +datum=WGS84 +units=m +no_defs")
#create a raster of the number of times individual with ID i have moved through a pixel
lenList <- rasterize(dfXYID[, c('V1', 'V2')], x, dfXYID[, 'uqid'], fun=function(x,...){length((na.omit(x)))})
return(lenList)
}
rasterList is then a list of rasters, each representing movement of individual i. I then have to do another time-consuming operation (can I do this in parallel within the above foreach?): apply a mosaic operation on this large list to get the 2 raster ouputs that I need.
I would be very grateful if anyone has any suggestions on how I can speed this up.
I understand gdal is an option, but I am constrained as I have to work with a CSV and I don't want to be writing each raster in rasterList to the disk to do further processing with gdal.

efficient use of raster functions in r

I have 500+ points in a SpatialPointsDataFrame object; I have a 1.7GB (200,000 rows x 200,000 cols) raster object. I want to have a tabulation of the values of the raster cells within a buffer around each of the 500+ points.
I have managed to achieve that with the code below (I got a lot of inspiration from here.). However, it is slow to run and I would like to make it run faster. It actually runs OK for buffers with "small" widths, say 5km ro even 15km (~1 million cells), but it becomes super slow when buffer increases to say 100km (~42 million cells).
I could easily improve on the loop below by using something from the apply family and/or a parallel loop. But my suspicion is that it is slow because the raster package writes 400Mb+ temporary files for each interaction of the loop.
# packages
library(rgeos)
library(raster)
library(rgdal)
myPoints = readOGR(points_path, 'myLayer')
myRaster = raster(raster_path)
myFunction = function(polygon_obj, raster_obj) {
# this function return a tabulation of the values of raster cells
# inside a polygon (buffer)
# crop to extent of polygon
clip1 = crop(raster_obj, extent(polygon_obj))
# crops to polygon edge & converts to raster
clip2 = rasterize(polygon_obj, clip1, mask = TRUE)
# much faster than extract
ext = getValues(clip2)
# tabulates the values of the raster in the polygon
tab = table(ext)
return(tab)
}
# loop over the points
ids = unique(myPoints$ID)
for (id in ids) {
# select point
myPoint = myPoints[myPoints$ID == id, ]
# create buffer
myPolygon = gBuffer(spgeom = myPoint, byid = FALSE, width = myWidth)
# extract the data I want (projections, etc are fine)
tab = myFunction(myPolygon, myRaster)
# do stuff with tab ...
}
My questions:
Am I right to partially blame the writing operations? If I managed to avoid all those writing operations, would this code run faster? I have access to a machine with 32GB of RAM -- so I guess it is safe to assume I could load the raster to the memory and need not to write temporary files?
What else could I do to improve efficiency in this code?
I think you should approach it like this
library(raster)
library(rgdal)
myPoints <- readOGR(points_path, 'myLayer')
myRaster <- raster(raster_path)
e <- extract(myRaster, myPoints, buffer=myWidth)
And then something like
etab <- sapply(e, table)
It is hard to answer your question #1 as we do not know enough about your data (we do not know how many cells are covered by a "100 km" buffer). But you can set options about when to write to file with the rasterOptions function. You notice that getValues is faster than extract, based on the post you link to, but I think that is wrong, or at least not very important. The combination of crop, rasterize and getValues should have a similar performance as extract (which does almost exactly that under the hood). If you go this route anyway, you should pass an empty RasterLayer, created by raster(myRaster) for faster cropping.

writing a loop for upscaling precipitation for USA

I am writing a code to calculate the mean amount of precipitation for different regions of conterminous USA. My total data has 300 times 120 (lon*lat) grids in Netcdf format. I want to write a loop in R to take the average of each 10 by 10 number of grids and assign that value (average) to all of the grids inside the region and repeat this for the next region. At the end instead of a 120 by 300 grids I will have 12 by 30 grids. So this is kind a upscaling method I want to apply to my data. I can use a for-loop for each region separately but It makes my code very huge and I don’t want to do that. Any idea would be appreciated. Thanks.
P.S: Here is the function I have written for one region (10by10) lat*lon.
upscaling <- function(file, variable, start.time=1, count.time=1)
{
library(ncdf) # load ncdf library to manipulate ncdf data
ncdata <- open.ncdf(file); # open ncdf file
lon <- get.var.ncdf(ncdata, "lon");
lat <- get.var.ncdf(ncdata, "lat");
time <- get.var.ncdf(ncdata, "time");
start.lon <- 1
end.lon <- length(lon)
start.lat <- 1
end.lat <- length(lat)
count.lon <- end.lon - start.lon + 1; # count number of longitude
count.lat <- end.lat - start.lat + 1; # count number of latitude
dat <- get.var.ncdf(ncdata, variable, start=c(start.lon, start.lat, 1),
count=c(count.lon, count.lat, 1))
temp.data<- array(0,dim=c(10,10))
for (i in 1:10)
{
for (j in 1:10)
{
temp.data <- mean(dat[i,j,])
}
}
}
There is no need to make a messy loop to spatially aggregate your data. Just use the aggregate function in the raster package:
library(raster)
a=matrix(data=c(1:100),nrow=10,ncol=10)
a=raster(a)
ra <- aggregate(a, fact=5, fun=mean) #fact=5 will aggregate using a 5x5 window
ra=as.matrix(ra)
ra
Now for your netcdf data, use raster's rasterFromXYZ to create the raster that can then be aggregated with the above method. Bonus includes the option to define your projection as an argument in the function so you end up with a georeferenced object at the end. This is important because if you aggregate your data without it you will then have to figure out by hand how to georeference the resulting matrix.
EDIT: If you want a resulting raster with the same dimensions as the original one, disaggregate the data right after aggregating it. While this seems redundant, these raster methods are very fast.
library(raster)
a=matrix(data=c(1:100),nrow=10,ncol=10)
a=raster(a)
ra <- aggregate(a, fact=5, fun=mean) #fact=5 will aggregate using a 5x5 window
ra <- disaggregate(ra, fact=5)
ra=as.matrix(ra)
ra
If you grid definitions follow standard netcdf conventions, then you might be able to remap using the CDO remapping functions. For first order conservative remapping you can try
cdo remapcon,grid_specification_here in.nc out.nc
Note that the answer given above is approximate, and not quite correct as the grid cell size is not the same as a function of latitude. The size of the error is likely small for this particular task as the cell sizes are fine, but nevertheless the answer will be slightly off.

Resources