Is there any way I can optimize this R code? - r

This is a code I'm trying to run in rstudio. I know the iterations are way too long. Is there any optimal/faster way to do this? I've been stuck for 4+ hours and it doesn't seem like finishing any time soon.
I'm trying to make a distance matrix between 415 cities and 3680126 monuments. To optimize, I am only comparing those monuments with cities which are present in the same country.
for(x in 1:3680126){
for(y in 1:415){
if(list2_cities$Country[y]==list1_POI$Country[x]){
distance_matrix [x,y] <- ({POI$Longitude[x]-cities$Longitude[y]}^2)+({POI$Latitude[x]-cities$Latitude[y]}^2)
}
else{
distance_matrix [x,y] <- 0
}
}
}

Maybe you can try distm from package geosphere
library(geosphere)
d <- distm(list1_POI[c("Longitude","Latitude")],list2_cities[c("Longitude","Latitude")])
m <- +(outer(list1_POI$Country,list2_cities$Country,`==`))
res <- d*m
where
the distm part gives the all paired distances between two cities
the outer part provides a mask such that values for non-matched cities are set to 0
If your desired matrix is sparse, here is another option
common <- intersect(list1_POI$Country,list2_cities$Country)
rl <- match(common,list1_POI$Country)
cl <- match(common,list2_cities$Country)
d <- diag(distm(list1_POI[rl,c("Longitude","Latitude")],list2_cities[cl,c("Longitude","Latitude")]))
res <- matrix(0,length(list1_POI$Country),length(list1_cities$Country))
res[cbind(rl,cl)] <- d
where you only need to locate the matched cities and calculate their distances.

Related

foreach doesn't change value of raster cell in R

I'm trying to simulate herding behavior in R.
Here's the code
library(raster)
library(sp)
library(foreach)
K=100
sig=0.2
G=0.3
x <- raster(ncol=2000,nrow=2000)
values(x) <- sign(rnorm(4000000,mean=0,sd=0.3))
y <- raster(ncol=2000,nrow=2000)
values(y) <- sign(rnorm(4000000,mean=0,sd=0.3))
#plot(x)
ei <- rnorm(4000000)
j=0
while(j < 30) {
for(i in 1:4000000){
ad <- adjacent(x,cell=c(i))[,2]
y[i] <- sign(K*sum(x[ad])+sig*ei[i]+G)
}
x <- y
plot(x)
j = j+1
}
The classic loop approach is too slow.
If I use a foreach loop instead of a classic for loop it doesn't change the values of y in every iteration.
I can't fix it at all.
Can someone please help about this?
Thank you
You have a dynamic model in which the output of each (time) step is input for the next step. It is not possible to do that in parallel. But that does not mean you cannot make the model run faster.
Looping over raster cells in R is always going to be slow, so we need to avoid that. Normally a problem like this could be solved with focal (see code a the bottom) --- but in this case it is difficult because you effectively use two rasters (x and ei) --- I will look at implementing multi-layer focal operations in the terra package.
Here is an approach with getFocalValues. It is much faster (and I use Sys.sleep to slow it down a bit).
library(raster)
set.seed(0)
x <- raster(ncol=200, nrow=200)
values(x) <- sign(rnorm(ncell(x),mean=0,sd=0.3))
y <- raster(x)
values(y) <- sign(rnorm(ncell(x),mean=0,sd=0.3))
ei <- rnorm(ncell(x))
K=100
sig=0.2
G=0.3
for (j in 1:29) {
# with large rasters, you may need to do the below in chunks
v <- getValuesFocal(x, 1, nrow(x), c(3,3))
# only keep the rook neighbors
v <- v[, c(2,4,6,8)]
v <- rowSums(v, na.rm=TRUE)
values(x) <- sign(K*v+sig*ei+G)
plot(x)
Sys.sleep(0.1)
}
This how you could use focal in similar cases
w <- matrix(c(0,1,0,1,0,1,0,1,0), 3, 3)
y <- focal(x, w, fun=function(i)sign(K*sum(i)+sig+G))
Also see the cellular automata examples in ?focal

raster calculation with condition of each cell by layers in R

I have stack raster dataset with several layers, however, I want to calculate the sum of each cell with for different layer selection, and finally generate a new layer, anyone has some good suggestion by using calc or overlay or some other raster calculation in R?
I can do by loops and make the calculation, but it will consume many times when I have many layers, and also use many of the storage, my script as follows,
## library(raster)
make_calc <- function(rr, start, end) {
rr <- as.array(rr)
start <- as.array(start)
end <- as.array(end)
dms <- dim(raster)
tmp <- array(NA, dim = dms[1:2])
for (i in 1:dms[1]) {
for (j in 1:dms[2]) {
tmp[i,j] <- sum(raster[i,j,start[i,j,1]:end[i,j,1]], na.rm = TRUE)
}
}
return(tmp)
}
rr <- raster(res = 10)
rr[] <- 1
rr <- stack(rr, rr, rr, rr)
start <- raster(res = 10)
start[] <- sample(1:2, ncell(start), replace = TRUE)
end <- raster(res = 10)
end[] <- sample(3:4, ncell(end), replace = TRUE)
result <- make_calc(rr, start, end)
Why are you coercing into arrays? You can easily collapse a raster into a vector but, that does not even seem necessary here. In the future, please try to be more clear on what your expected outcome is.
Based on your code, I really don't know what you are getting at. I am going to take a few guesses on summing specified rasters in the stack, drawing a random sample, across rasters to be summed and finally, drawing a random sample of cells to be summed.
For a sum on specified rasters in a stack, you can just index what you are after in the stack using a double bracket. In this case, rasters 1 and 3 in the stack would be the only ones summed.
library(raster)
rr <- raster(res = 10)
rr[] <- 1
rr <- stack(rr, rr, rr, rr)
( sum_1_3 <- calc(rr[[c(1,3)]], sum) )
If you are wanting a random sample of the values across rasters, for every cell, you could write a function that is passed to calc. Here is an example that grabs a random sample of n size, across the raster layers values and sums them.
rs.sum <- function(x, n=2) {sum( x[sample(1:length(x),n)], na.rm=TRUE)}
rs.sum.raster <- calc(rr, rs.sum)
If you are wanting to apply a function to a limited random selection of cells, you could create a random sample of the raster that would be used as an index. Here we create a random sample of cells, create an empty raster and pipe the sum of rasters 1 and 2 (in the stack) based on the random sample cell index. A raster in the stack is indexed using the double bracket and the raster values are indexed using a single bracket so, for raster 1 in the stack with limiting to the values in the random sample you would use: rr[[1]][rs]
rs <- sample(1:ncell(rr[[1]]), 300)
r.sum <- rr[[1]]
r.sum[] <- NA
r.sum[rs] <- rr[[1]][rs] + rr[[2]][rs]
plot(r.sum)

R focal (raster package): how to apply filter to subsets of background data?

I was wondering if any of you could help me with the following task dealing with the focal() function in the R raster package.
By default, the focal() function will run through each cell of a given raster ('background' raster hereafter) and apply a given function to the neighboring values as defined by a user-defined moving window. In order to optimize and speed up my computation on large rasters/study areas, I would like to apply this function (filter) only when the 'background' raster has some values (e.g. greater than zero) within the extent covered by the 'moving window' and skip all the other focal cells. This way, the filter would not spend time computing any focal value where there is no need to.
Below a reproducible small example and in-line comments:
library(raster)
x <- matrix(1:25, ncol=5)
x[c(1,2,3,6,7,8,11,12,13)] <- 0
r <- raster(x)
#Apply filter to focal cells and return values using a 3x3 moving window...ONLY IF
#ALL values found within the window are > 0. Skip focal cell otherwise.
r3 <- focal(r, w=matrix(1/9,nrow=3,ncol=3), FUN=sum)
How should I change this function to have the desired outcome?
The windows slide operates # all focal pixels locations. Skipping/Jumping locations is not possible. However, you can check, whether all the elements/matrix cells satisfy your threshold condition as below:
myfunc = function (x){
if(all(x > threshold)){
print(x)
x = sum(x)
}else{
x = 0}
}
r3 <- focal(x=r>1, w=matrix(1/9,nrow=3,ncol=3), fun=sum)
I'm not sure it would be any faster, but you could also check if the center cell adheres to some criteria (e.g. is NA or >0). This way the focal calculation will only run when it meets those criteria.
w=matrix(1,5,5)
skipNA_avgFunc <- function(x) {
# general definition of center cell for weight matrices with odd dims
center <- x[ceiling(length(x)/2)]
if (is.na(center)) { # handle NA values
return(center)
}
else if (center >= 0) { # calculate mean
mean(x, na.rm=TRUE)
}
}
r3 <- focal(r, w=w, fun=skipNA_avgFunc, pad=TRUE, padValue=NA)

Choose n most evenly spread points across point dataset in R

Given a set of points, I am trying to select a subset of n points that are most evenly distributed across this set of points. In other words, I am trying to thin out the dataset while still evenly sampling across space.
So far, I have the following, but this approach likely won't do well with larger datasets. Maybe there is a more intelligent way to choose the subset of points in the first place...
The following code randomly chooses a subset of the points, and seeks to minimize the distance between the points within this subset and the points outside of this subset.
Suggestions appreciated!
evenSubset <- function(xy, n) {
bestdist <- NA
bestSet <- NA
alldist <- as.matrix(dist(xy))
diag(alldist) <- NA
alldist[upper.tri(alldist)] <- NA
for (i in 1:1000){
subset <- sample(1:nrow(xy),n)
subdists <- alldist[subset,-subset]
distsum <- sum(subdists,na.rm=T)
if (distsum < bestdist | is.na(bestdist)) {
bestdist <- distsum
bestSet <- subset
}
}
return(xy[bestSet,])
}
xy2 <- evenSubset(xy=cbind(rnorm(1000),rnorm(1000)), n=20)
plot(xy)
points(xy2,col='blue',cex=1.5,pch=20)
Following #Spacedman's suggestion, I used voronoi tesselation to identify and drop those points that were closest to other points.
Here, the percentage of points to drop is given to the function. This appears to work quite well, except for the fact that it is slow with large datasets.
library(tripack)
voronoiFilter <- function(occ,drop) {
n <- round(x=(nrow(occ) * drop),digits=0)
subset <- occ
dropped <- vector()
for (i in 1:n) {
v <- voronoi.mosaic(x=subset[,'Longitude'],y=subset[,'Latitude'],duplicate='error')
info <- cells(v)
areas <- unlist(lapply(info,function(x) x$area))
smallest <- which(areas == min(areas,na.rm=TRUE))
dropped <- c(dropped,which(paste(occ[,'Longitude'],occ[,'Latitude'],sep='_') == paste(subset[smallest,'Longitude'],subset[smallest,'Latitude'],sep='_')))
subset <- subset[-smallest,]
}
return(occ[-dropped,])
}
xy <- cbind(rnorm(500),rnorm(500))
colnames(xy) <- c('Longitude','Latitude')
xy2 <- voronoiFilter(xy, drop=0.7)
plot(xy)
points(xy2,col='blue',cex=1.5,pch=20)

R: pairwise Euclidean distance between columns of two matrices

The following loop takes too lonng to run (2mins/iteration)
The tumor_signals is size 950000x422
The normal_signals is size 950000x772
Any ideas for how to speed it up?
for(i in 1:ncol(tumor_signals)){
x <- as.vector(tumor_signals[,i])
print("Assigned x")
y <- t((t(normal_signals) - x)^2)
print("assigned y")
y <- t(sqrt(colSums(y)))
print("done")
#all_distance <- cbind(all_distance,matrix(distance))
print(i)
}
There's a bug in your code -- you don't need to take the transpose of normal_signals. As I understand it, you are trying to compute, for all i = 1,2,...422, and j=1,2,...,772, the Euclidean distance between tumor_signals[,i] and normal_signals[,j]. You would probably want the results in a 422 x 772 matrix. There's a function rdist() in the package fields that will do this for you:
require(fields)
result <- rdist(t(tumor_signals), t(normal_signals))
Incidentally, a Google search for [R Euclidean distance] would have easily found this package.

Resources