Improving performance when working with geodata in R - r

I wrote the following script to produce the raw data for plotting the above map. The problem is, for 550,000 data points, this takes about 2 hours to run on a relatively powerful machine. I'm new to R, however, and I'm wondering if there are any optimized functions I can take advantage of?
The basic idea is that, given a set of geospatial data, you split the set into 200 rows, and split each row into a bunch of squares. You then calculate the total of a value in each square in a row. The approach I've taken below is to take the "upper left" point of a square, calculate the latitude/longitude of edges of the square, and exclude all points not in those bounds, and then sum what remains. Is there a better way without using a solution like PostGIS?
all.data <- read.csv("FrederictonPropertyTaxDiffCleanedv3.csv", header=TRUE,
stringsAsFactors=FALSE)
all.data$X <- as.numeric(all.data$X)
all.data$Y <- as.numeric(all.data$Y)
startEnd <- function(lats, lngs) {
# Find the "upper left" (NW) and "bottom right" (SE) coordinates of a set of data.
#
# Args:
# lats: A list of latitude coordinates
# lngs: A list of longitude coordinates
#
# Returns:
# A list of values corresponding to the northwest-most and southeast-most coordinates
# Convert to real number and remove NA values
lats <- na.omit(as.numeric(lats))
lngs <- na.omit(as.numeric(lngs))
topLat <- max(lats)
topLng <- min(lngs)
botLat <- min(lats)
botLng <- max(lngs)
return(c(topLat, topLng, botLat, botLng))
}
startEndVals <- startEnd(all.data$Y, all.data$X)
startLat <- startEndVals[1]
endLat <- startEndVals[3]
startLng <- startEndVals[2]
endLng <- startEndVals[4]
num_intervals = 200.0
interval <- (startEndVals[1] - startEndVals[3]) / num_intervals
# testLng <- -66.6462379307115
# testLat <- 45.9581234392
# Prepare the data to be sent in
data <- all.data[,c("Y", "X", "levy2014_ha")]
sumInsideSquare <- function(pointLat, pointLng, interval, data) {
# Sum all the values that fall within a square on a map given a point,
# an interval of the map, and data that contains lat, lng and the values
# of interest
colnames(data) <- c("lat", "lng", "value")
# Data east of point
data <- data[data$lng > pointLng,]
# Data west of point + interval
data <- data[data$lng < pointLng + interval,]
# Data north of point + interval (down)
data <- data[data$lat > pointLat - interval,]
# Data south of point
data <- data[data$lat < pointLat, ]
# Clean remaining data
data <- na.omit(data)
return(sum(data$value))
}
# Debugging
# squareSumTemp <- sumInsideSquare(testLat, testLng, interval, data)
# Given a start longitude and an end longitude, calculate an array of values
# corresponding to the sums for that latitude
calcSumLat <- function(startLng, endLng, lat, interval, data) {
row <- c()
lng <- startLng
while (lng < endLng) {
row <- c(row, sumInsideSquare(lat, lng, interval, data))
lng <- lng + interval
}
return(row)
}
# Debugging
# rowTemp <- calcSumLat(startLng, endLng, testLat, interval, data)
# write.csv(rowTemp, file = "Temp.csv", row.names = FALSE)
# Get each line of data to plot
lat <- startLat
rowCount <- 1
all.sums <- list()
while (lat > endLat) {
col <- calcSumLat(startLng, endLng, lat, interval, data)
all.sums[[as.character(rowCount)]] <- col
lat <- lat - interval
rowCount <- rowCount + 1
}
# Convert to data frame
all.sums.frame <- data.frame(all.sums)
# Save to disk so I don't have to run it again
write.csv(all.sums.frame, file = "Levy2014Sums200.csv", row.names = FALSE)

Ended up finding a solution to this myself. The key to it was using the foreach package with the doParallel package so it could take advantage of all the cores on my computer. There is a great guide on it here: http://www.r-bloggers.com/a-brief-foray-into-parallel-processing-with-r/

Related

double loop does not match

I'm having trouble with a loop that I'm trying to do.
I have data for monthly temperature (web-scraping from TerraClimate). I have to set the latitude and longitude, and my code gives me one observation for the average temperature of each month.
I have many pairs of coordinates that I need to web scrape so I'm doing a loop. The problem is that I'm trying to paste the name of the location and it's not working.
Here is an example:
I have a list with the coordinates
and an object with the names
loc1 <- c(-70.6666667, -33.4500000)
loc2 <- c(-71.6163889,-33.0458333)
c=list(loc1,loc2)
x=c("Loc 1", "Loc 2")
And my code looks like this:
# web scraping temperature data
var <- "tmax"
terra <- paste0(paste0("http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg_terraclimate_",var),"_1958_CurrentYear_GLOBE.nc")
nc <- nc_open(terra)
lon <- ncvar_get(nc, "lon")
lat <- ncvar_get(nc, "lat")
# creating empty object to paste each set of data
DAT <- NULL
# loop that brings up the temperatures for each set of coordinates in c
for(i in c){
# names of locations
for(n in x){
flon = match(abs(lon -i[1]) < 1/48, 1)
lonindex = which(flon %in% 1)
flat = match(abs(lat -i[2]) < 1/48, 1)
latindex = which(flat %in% 1)
start <- c(lonindex, latindex, 1)
count <- c(1, 1, -1)
data <- as.numeric(ncvar_get(nc, varid = var,start = start, count))
data <- as.data.table(data)
data$date <- seq(as.Date("1958-02-01"), length.out=756, by="month")
data$ym_chr <- format(data$date, format = "%Y-%m")
data[,date:=NULL]
data$LOCATION <- as.character(c[n])
DAT <- rbind(DAT, data)
}
}
The problem is that the second part of the loop is not matching how I would like. Each location has 756 observations, so in total I should have 1512 obs. For the first time the loop runs, I would like "Loc 1" to be under LOCATION, and then, the second time the loop runs, for LOCATION to be "Loc 2".
But instead I'm getting 3024 observations.
This is a replicable code.

Looped raster extractions from a SpatialPointsDataFrame; each row needs to extract from a corresponding raster indicated in a column

I am attempting to make a loop that extracts values from NDVI rasters for multiple shapefiles. The shapefiles are animal GPS locations that include a date and a time. The shapefiles include random locations that have been generated from the population's range and I am therefore assigning each random location that didn't have a date, a date from one of the GPS locations in a 5:1 ratio. I have a function that finds the closest date to the date for each GPS location or random point and stores it in the dataframe as "x$NDVIfile" The code all up to this point but I think the problem is here specifically:
for(j in length(nrow(x))){
a <- raster(paste0("E:/RSF_GIS/HabitatVariables/NDVI/",
x$NDVIfile[j], ".tif"))
x$ndvi[j] <- raster::extract(a, x[j,]) # extract each row based on the closest NDVI file
}
I want to extract from each row in my data the raster from the directory that corresponds to "x$NDVIfile[j]". My current outcome looks like this. All the extracted NDVI values are the same for each dataframe and I have the suspicion that the first raster referenced is the only raster being extracted from for each dataframe :
dataframe x
Here is the entire code :
RSF_dir <- list.files("E:/RSF_GIS/RSF_files",
pattern = "*.shp",
full.names = TRUE)
ndvi_dir = list.files("E:/RSF_GIS/HabitatVariables/NDVI",
pattern = "*.tif",
full.names = FALSE)
ndvi_dir.df <- tools::file_path_sans_ext(basename(ndvi_dir))
ndvi_dir.df <- as.Date(ndvi_dir.df)
ndvi_dir.df <- as.data.frame(ndvi_dir.df)
for (i in 1:length(RSF_dir)) {
x <- rgdal::readOGR(RSF_dir[i])
x <- as.data.frame(x)
nona <- x$Acqst_T[!is.na(x$Acqst_T)] # make a list of non- NA values
nona <- rep.int(nona, times = 6) # have the list repeat itself 6 times (5:1 = random_location:GPS_fix)
x$Acqst_T <- dplyr::coalesce(x$Acqst_T, nona) # and assign to NAs
x$date <- as.Date(x$Acqst_T, format = '%Y-%m-%d %H:%M:%S')
min_distances <- as.numeric(x$date)- matrix(rep(as.numeric(ndvi_dir.df$ndvi_dir),nrow(x)),ncol=length(ndvi_dir.df$ndvi_dir),byrow=T)
min_distances <- as.data.frame(t(min_distances))
closest <- sapply(min_distances,function(o) { # function to find the closest NDVI date for each GPS fix
w <- which(o==min(o[o>0])); # (MOD09Q1 collects imagery every 8 days)
ifelse(length(w)==0,NA,w)
})
x$NDVIfile <- as.Date(ndvi_dir.df$ndvi_dir[closest])
x <- SpatialPointsDataFrame(data.frame(x$coords.x1, x$coords.x2), x, proj4string=veg_INREV#proj4string)
for(j in length(nrow(x))){
a <- raster(paste0("E:/RSF_GIS/HabitatVariables/NDVI/",
x$NDVIfile[j], ".tif"))
x$ndvi[j] <- raster::extract(a, x[j,]) # extract each row based on the closest NDVI file
}
writeOGR(obj= x, dsn="E:/RSF_GIS/RSF_files/trial",
layer=(paste0(tools::file_path_sans_ext(basename(RSF_dir[i])))), driver="ESRI Shapefile", overwrite_layer = TRUE)
}
Thanks!
Answering my own question, the solution was simply a change in code indicating the iterations of the loop:
for(j in length(nrow(x))){
...
}
should have been instead
for(j in 1:nrow(x)){
...
}

average gridded climate data for duplicated times in r

I have a gridded climate dataset, such as:
# generate time vector
time1 <- seq(14847.5,14974.5, by = 1)
time2 <- seq(14947.5,14974.5, by = 1)
time <- c(time1,time2)
time <- as.POSIXct(time*86400,origin='1970-01-01 00:00')
# generate lat and lon coordinates
lat <- seq(80,90, by = 1)
lon <- seq(20,30, by = 1)
# generate 3dimensional array
dat <- array(runif(length(lat)*length(lon)*length(time)),
dim = c(length(lon),length(lat),length(time)))
such that
> dim(dat)
[1] 11 11 156
the dimensions of the data are describing the variable at different longitude (dim = 1), latitude (dim = 2), and time (dim = 3).
The issue I have at the moment is that some of the times are repeated, something to do with overlapping sensors measuring the data. Therefore, I was wondering if it was possible to only keep the unique times for dat, but average the data within the grid for the duplicated times i.e. if there are two repeated days we take the average value in each latitude and longitude grid for that time.
I can find the unique times as:
# only select unique times
new_time <- unique(time)
unique_time <- unique(time)
The following code then aims to loop through each grid (lat/lon) and average all of the duplicated days.
# loop through lat/lon coordinates to generate new data
new_dat <- array(dim = c(length(lon),length(lat),length(new_time)))
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
dat2 <- dat[i,ii,]
dat2b <- NA
for(k in 1:length(unique_time)){
idx <- time == unique_time[k]
dat2b[k] <- mean(dat2[idx], na.rm = TRUE)
}
new_dat[i,ii,] <- dat2b
}
}
I'm convinced that this provides the correct answer, but I'm certain there is a much cleaner method do achieve this.
I should also note that my data is quite large (i.e. k = 7000), so this last loop is not very efficient, to say the least.
My original answer:
This is a bit more concise and efficient by use of aggregate:
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
new_dat[i,ii,] <- as.numeric(aggregate(dat[i,ii,], by=list(time),mean)$x)
}
}
It still has 2 out of the 3 of the loops, but it manages to bypass creating dat2, dat2b, and unique_time.
My improved answer:
f <- function(i, ii){as.numeric(aggregate(dat[i,ii,], by=list(time),mean)$x)}
for(i in 1:nrow(expand.grid(1:length(lon),1:length(lat)))){
new_dat[expand.grid(1:length(lon),1:length(lat))[i,1],
expand.grid(1:length(lon),1:length(lat))[i,2],] <-
f(expand.grid(1:length(lon),1:length(lat))[i,1],expand.grid(1:length(lon),
1:length(lat))[i,2])
}
Got it down to just 1 loop. We could probably bypass that loop too with an apply.

Counting spatialpoints in gridcells

SO-gurues!
I am trying to count the densities of surviving units in different gridcells.
I have two shapefiles with points from the two survey periods in question (one before and one after the mortality event). What I intend is to see whether there is a difference in survival rates and link the proportion of survival to any climatic variable obtained from the raster value of the desired grid. In the code snippet below I have created some random raster and shapefiles.
packs = c('raster', 'rgdal', 'spatstat', 'sp' ,'dplyr')
sapply(packs, FUN = 'require', character.only = TRUE)
xy <- matrix(rnorm(1024),32,32) #Creating the desired raster
image(xy)
rast <- raster(xy)
extent(rast) <- c(36,37,-3,-2)
projection(rast) <- CRS("+proj=longlat +datum=WGS84")
points <- runifpoint(n =4000, c(36,37,-3,-2)) # Creating the points
x <- points$x
y <- points$y
values <- c(rep(1, 900), rep(0, 3100))
xy <- cbind(x, y)
points <- cbind(x, y, values)
points <- data.frame(points)
shp <- SpatialPointsDataFrame(coords = xy, data = data.frame(values) ) # creating shpfiles
projection(shp) <- CRS("+proj=longlat +datum=WGS84")
subs <- filter(points, values == 1)
suxy <- select(subs, x,y)
shpsub <- SpatialPointsDataFrame(coords = suxy, data = data.frame(subs$values)) # creating shpfiles
projection(shpsub) <- CRS("+proj=longlat +datum=WGS84")
When I attempt to extract the points I use the following lines of code
shp <- spTransform(shp, projection(rast)) # make sure they have same transformation
shpsub <- spTransform(shpsub, projection(rast))
XY <- xyFromCell(rast, cell = 1:ncell(rast))
v <- as.data.frame(rast) #Extract values from raster
XY <- data.frame(XY, v) # Creating a data frame containing coord., cellno and value
XY$cell <- c(1:ncell(rast))
cells <- cellFromXY(rast,shp) # find which cells the points are in
cells <- rle(cells) # returns a value and a length, fast for counting
cellsfound <- cellFromXY(rast,shpsub)
cellsfound <- rle(cellsfound)
Proportion <- data.frame(cell = cells$values, shp = cells$lengths)
test <- data.frame(cell = rep(NA,NROW(Proportion)), shpsub = rep(NA, NROW(Proportion)))
test$cell <- c(cellsfound$values, rep(NA, nrow(test) - length(cellsfound$values)))
test$shpsub <- c(cellsfound$lengths, rep(NA, NROW(test) - length(cellsfound$lengths)))
Proportion <- full_join(Proportion, test, by = "cell")
test.Proportion <- mutate(Proportion, Proportion = shpsub/shp) #Calculating Proportion
XY <- left_join(XY, test.Proportion, by = "cell") # Adding Proportion to coord and cell no.
XY.m <- summarise(XY, )
XY <- na.omit(XY) ; XY <- XY[,-4]
As I see it. Using rle() returns the same cells multiple times instead of counting the no of points within each individual cell as was my intention. Can anyone please explain me how to do this in a way that retrieves the information on the number of occurrences in the individual cells?

Calculate the distance between two points of two datasets (nearest neighbor)

I want to calculate the distance between two points in two different datasets. I don't want to calculate the distance between all points - just to the nearest point of datasetB.
Some examples:
Dataset A - Persons
http://pastebin.com/HbaeqACi
Dataset B - Waterfeatures:
http://pastebin.com/UdDvNtHs
Dataset C - City:
http://pastebin.com/nATnkMRk
So...I want to calculate the distance of each person to the nearest waterfeature point.
I've already tried to work with the rgeos package and after struggling with some projections errors, I've got it to work. But this calculate (at least I assume it) all distances to every point, but, as already said, I've only interested in the distance to the nearest waterfeature point.
# load csv files
persons = read.csv("persons.csv", header = TRUE)
water = read.csv("water.csv", header = TRUE)
# change dataframes to SpatialPointDataFrame and assign a projection
library(sp)
library(rgeos)
coordinates(persons) <- c("POINT_X", "POINT_Y")
proj4string(persons) <- CRS("+proj=utm +datum=WGS84")
coordinates(water) <- c("POINT_X", "POINT_Y")
proj4string(water) <- CRS("+proj=utm +datum=WGS84")
# use rgoes package to calculate the distance
distance <- gDistance(persons, water, byid=TRUE)
# works, but calculates a huge number of distances
Is there any parameter, which I've missed. Or do I need to use another package or function? I've also looked at spatstat, which is able to calculate the distance to the nearest neighbor, but not of two different datasets: http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/spatstat/html/nndist.html
Edit:
The complete R-Script including plotting of the datasets:
library(RgoogleMaps)
library(ggplot2)
library(ggmap)
library(sp)
library(fossil)
#load data
persons = read.csv("person.csv", header = TRUE, stringsAsFactors=FALSE)
water = read.csv("water.csv", header =TRUE, stringsAsFactors=FALSE)
city = read.csv("city.csv", header =TRUE)
# plot data
persons_ggplot2 <- persons
city_ggplot2 <- city
water_ggplot2 <- water
gc <- geocode('new york, usa')
center <- as.numeric(gc)
G <- ggmap(get_googlemap(center = center, color = 'bw', scale = 1, zoom = 11, maptype = "terrain", frame=T), extent="device")
G1 <- G + geom_point(aes(x=POINT_X, y=POINT_Y ),data=city, shape = 22, color="black", fill = "yellow", size = 4) + geom_point(aes(x=POINT_X, y=POINT_Y ),data=persons, shape = 8, color="red", size=2.5) + geom_point(aes(x=POINT_X, y=POINT_Y ),data=water_ggplot2, color="blue", size=1)
plot(G1)
#### calculate distance
# Generate unique coordinates dataframe
UniqueCoordinates <- data.frame(unique(persons[,4:5]))
UniqueCoordinates$Id <- formatC((1:nrow(UniqueCoordinates)), width=3,flag=0)
# Generate a function that looks for the closest waterfeature for each id coordinates
NearestW <- function(id){
tmp <- UniqueCoordinates[UniqueCoordinates$Id==id, 1:2]
WaterFeatures <- rbind(tmp,water[,2:3])
tmp1 <- earth.dist(WaterFeatures, dist=TRUE)[1:(nrow(WaterFeatures)-1)]
tmp1 <- which.min(tmp1)
tmp1 <- water[tmp1,1]
tmp1 <- data.frame(tmp1, WaterFeature=tmp)
return(tmp1)
}
#apply to each id and the merge
CoordinatesWaterFeature <- ldply(UniqueCoordinates$Id, NearestW)
persons <- merge(persons, CoordinatesWaterFeature, by.x=c(4,5), by.y=c(2,3))
What about writing a function that looks for the nearest waterfeature for every person?
#requires function earth.dist from "fossil" package
require(fossil)
#load data
persons = read.csv("person.csv", header = TRUE, stringsAsFactors=FALSE)
water = read.csv("water.csv", header =TRUE, stringsAsFactors=FALSE)
#Generate unique coordinates dataframe
UniqueCoordinates <- data.frame(unique(persons[,4:5]))
UniqueCoordinates$Id <- formatC((1:nrow(UniqueCoordinates)), width=3,flag=0)
#Generate a function that looks for the closest waterfeature for each id coordinates
NearestW <- function(id){
tmp <- UniqueCoordinates[UniqueCoordinates$Id==id, 1:2]
WaterFeatures <- rbind(tmp,water[,2:3])
tmp1 <- earth.dist(WaterFeatures, dist=TRUE)[1:(nrow(WaterFeatures)-1)]
tmp1 <- min(tmp1)
tmp1 <- data.frame(tmp1, WaterFeature=tmp)
return(tmp1)
}
#apply to each id and the merge
CoordinatesWaterFeature <- ldply(UniqueCoordinates$Id, NearestW)
persons <- merge(persons, CoordinatesWaterFeature, by.x=c(4,5), by.y=c(2,3))
NOTE: I've added a stringsAsFactors parameter to the original read.csv , it make the merging easier at the end
NOTE:Column tmp1 notes the number of METERS to the nearest water feature
Maybe I'm a little too late, but you can use spatstat to compute distances between two different datasets. The command is nncross. The arguments you have to use are two objects of type ppp, which you can create using the as.ppp() function.

Resources