Choose n most evenly spread points across point dataset in R - r

Given a set of points, I am trying to select a subset of n points that are most evenly distributed across this set of points. In other words, I am trying to thin out the dataset while still evenly sampling across space.
So far, I have the following, but this approach likely won't do well with larger datasets. Maybe there is a more intelligent way to choose the subset of points in the first place...
The following code randomly chooses a subset of the points, and seeks to minimize the distance between the points within this subset and the points outside of this subset.
Suggestions appreciated!
evenSubset <- function(xy, n) {
bestdist <- NA
bestSet <- NA
alldist <- as.matrix(dist(xy))
diag(alldist) <- NA
alldist[upper.tri(alldist)] <- NA
for (i in 1:1000){
subset <- sample(1:nrow(xy),n)
subdists <- alldist[subset,-subset]
distsum <- sum(subdists,na.rm=T)
if (distsum < bestdist | is.na(bestdist)) {
bestdist <- distsum
bestSet <- subset
}
}
return(xy[bestSet,])
}
xy2 <- evenSubset(xy=cbind(rnorm(1000),rnorm(1000)), n=20)
plot(xy)
points(xy2,col='blue',cex=1.5,pch=20)

Following #Spacedman's suggestion, I used voronoi tesselation to identify and drop those points that were closest to other points.
Here, the percentage of points to drop is given to the function. This appears to work quite well, except for the fact that it is slow with large datasets.
library(tripack)
voronoiFilter <- function(occ,drop) {
n <- round(x=(nrow(occ) * drop),digits=0)
subset <- occ
dropped <- vector()
for (i in 1:n) {
v <- voronoi.mosaic(x=subset[,'Longitude'],y=subset[,'Latitude'],duplicate='error')
info <- cells(v)
areas <- unlist(lapply(info,function(x) x$area))
smallest <- which(areas == min(areas,na.rm=TRUE))
dropped <- c(dropped,which(paste(occ[,'Longitude'],occ[,'Latitude'],sep='_') == paste(subset[smallest,'Longitude'],subset[smallest,'Latitude'],sep='_')))
subset <- subset[-smallest,]
}
return(occ[-dropped,])
}
xy <- cbind(rnorm(500),rnorm(500))
colnames(xy) <- c('Longitude','Latitude')
xy2 <- voronoiFilter(xy, drop=0.7)
plot(xy)
points(xy2,col='blue',cex=1.5,pch=20)

Related

R terra calculate area moment of inertia OR how to get (weighted) raster-cell distance from patch-centroid

I'm trying to calculate a measure akin to the moment of inertia using a raster layer and I am struggling to figure out how to get the distance of each cell to a patch's centroid and then extracting both that distance and the cell's value.
I want to calculate the moment of inertia (get the squared distance of each cell to its patches centroid, multiply by value of cell, sum these values by patch, and then divide by the sum of all values per patch). I provide a simplified set-up below. The code creates a simple raster layer, patches clusters of cells, and gets their centroids. I know that the function in question to use next is probably terra::distance (maybe in combination with terra::zonal?!) -- how do I calculate the distance by patch?
#lonlat
library(terra)
r <- rast(ncols=36, nrows=18, crs="+proj=longlat +datum=WGS84")
r[498:500] <- 1
r[3:6] <- 1
r[111:116] <- 8
r[388:342] <- 1
r[345:349] <- 3
r_patched <- patches(r, directions = 8, allowGaps = F)
testvector <- terra::as.polygons(r_patched, trunc=T, dissolve = T)
p_centr <- geom(centroids(testvector), df=T)
##next steps
#1. get distance of each cell from patch's centroid
#r <- distance(r)
#2. multiply cell value by squared distance to centroid
I think you need to loop over the patches. Something like this:
p_centr <- centroids(testvector)
v <- rep(NA, length(p_centr))
for (i in 1:length(p_centr)) {
x <- ifel(r_patched == p_centr$patches[i], i, NA)
x <- trim(x)
d <- distance(x, p_centr[i,])
d <- mask(d, x)
# square distance and multiply with cell values
d <- d^2 * crop(r, d)
v[i] <- global(d, "sum", na.rm=TRUE)[[1]]
}
v / sum(v)
#[1] 1.213209e-05 1.324495e-02 9.864759e-01 2.669833e-04

Is there any way I can optimize this R code?

This is a code I'm trying to run in rstudio. I know the iterations are way too long. Is there any optimal/faster way to do this? I've been stuck for 4+ hours and it doesn't seem like finishing any time soon.
I'm trying to make a distance matrix between 415 cities and 3680126 monuments. To optimize, I am only comparing those monuments with cities which are present in the same country.
for(x in 1:3680126){
for(y in 1:415){
if(list2_cities$Country[y]==list1_POI$Country[x]){
distance_matrix [x,y] <- ({POI$Longitude[x]-cities$Longitude[y]}^2)+({POI$Latitude[x]-cities$Latitude[y]}^2)
}
else{
distance_matrix [x,y] <- 0
}
}
}
Maybe you can try distm from package geosphere
library(geosphere)
d <- distm(list1_POI[c("Longitude","Latitude")],list2_cities[c("Longitude","Latitude")])
m <- +(outer(list1_POI$Country,list2_cities$Country,`==`))
res <- d*m
where
the distm part gives the all paired distances between two cities
the outer part provides a mask such that values for non-matched cities are set to 0
If your desired matrix is sparse, here is another option
common <- intersect(list1_POI$Country,list2_cities$Country)
rl <- match(common,list1_POI$Country)
cl <- match(common,list2_cities$Country)
d <- diag(distm(list1_POI[rl,c("Longitude","Latitude")],list2_cities[cl,c("Longitude","Latitude")]))
res <- matrix(0,length(list1_POI$Country),length(list1_cities$Country))
res[cbind(rl,cl)] <- d
where you only need to locate the matched cities and calculate their distances.

R function for creating discs around each point in a pattern, then counting number of points in each disc [spatial]

I am attempting to create a disc for each point in a pattern; each disc will have the same radius. Then for each disc, I want to count the number of points falling within the disc. Each pattern has 100-400 points. I have written code to do this, but it is quite slow. The code is below. I cannot provide the shapefile and points as that would be very difficult, but I could create some dummy data if need be.
W <- as.owin(shape)
#Converts created .shp file into a "window"
#in which everything is plotted and calculated
SPDF <- SpatialPointsDataFrame(P[,1:2], P)
#Converts data frame to spatial points data frame
SP <- as(SPDF, "SpatialPoints") #Converts SPDF to spatial points
SP1 <- as.ppp(coordinates(SP), W)
SP2 <- as.ppp(SP1)
attr(SP1, "rejects")
attr(SP2, "rejects")
aw <- area.owin(W) #Area, in pixels squared, of leaf window created earlier
#awm <- aw * (meas)^2 * 100 #Area window in millimeters squared
# Trichome_Density_Count-----------------------------------------------------------------------------------------------
TC <- nrow(P) #Counts number of rows in XY data points file,
#this is number of trichomes from ImageJ
TD <- TC/awm #Trichome density, trichomes per mm^2
#SPDF2 <- as.SpatialPoints.ppp(SP2)
#kg <- knn.graph(SPDF2, k = 1)
#Creates the lines connecting each NND pairwise connection
#dfkg <- data.frame(kg) #Converts lines into a data frame
#dfkgl <- dfkg$length
meanlength <- 78
discstest <- discs(SP2, radii = meanlength,
separate = TRUE, mask = FALSE, trim = FALSE,
delta = NULL, npoly=NULL)
#Function creates discs for each trichome
#Using nearest neighbor lengths as radii
#NEED TO ADD CLIPPING
ratiolist <- c()
for (i in 1:length(discstest)) {
ow2sp <- owin2SP(discstest[[i]])
leafsp <- owin2SP(W)
tic("gIntersection")
intersect <- rgeos::gIntersection(ow2sp, leafsp)
Sys.sleep(1)
toc()
tic("over")
res <- as.data.frame(sp::over(SP, intersect, returnList = FALSE))
Sys.sleep(1)
toc()
res[is.na(res)] <- 0
newowin <- as.owin(intersect)
circarea <- area.owin(newowin)
trichactual <- sum(res)
trichexpect <- (TC / aw) * circarea
ratio <- trichactual / trichexpect
ratiolist[[i]] <- ratio
}
If I understand you correctly you want to loop through each point and check how many points fall within a disc of radius R centered in that point. This is done very efficiently in spatstat with the function closepaircounts:
closepaircounts(SP2, r = meanlength)
This simply returns a vector with the number of points contained in the disc of radius r for each point in SP2.
I have just tried this for 100,000 points where each point on average had almost 3000 other points in the disc around it, and it took 8 seconds on my laptop. If you have many more points or in particular if the disc radius is so big that each disc contains many more points it may become very slow to calculate this.

raster calculation with condition of each cell by layers in R

I have stack raster dataset with several layers, however, I want to calculate the sum of each cell with for different layer selection, and finally generate a new layer, anyone has some good suggestion by using calc or overlay or some other raster calculation in R?
I can do by loops and make the calculation, but it will consume many times when I have many layers, and also use many of the storage, my script as follows,
## library(raster)
make_calc <- function(rr, start, end) {
rr <- as.array(rr)
start <- as.array(start)
end <- as.array(end)
dms <- dim(raster)
tmp <- array(NA, dim = dms[1:2])
for (i in 1:dms[1]) {
for (j in 1:dms[2]) {
tmp[i,j] <- sum(raster[i,j,start[i,j,1]:end[i,j,1]], na.rm = TRUE)
}
}
return(tmp)
}
rr <- raster(res = 10)
rr[] <- 1
rr <- stack(rr, rr, rr, rr)
start <- raster(res = 10)
start[] <- sample(1:2, ncell(start), replace = TRUE)
end <- raster(res = 10)
end[] <- sample(3:4, ncell(end), replace = TRUE)
result <- make_calc(rr, start, end)
Why are you coercing into arrays? You can easily collapse a raster into a vector but, that does not even seem necessary here. In the future, please try to be more clear on what your expected outcome is.
Based on your code, I really don't know what you are getting at. I am going to take a few guesses on summing specified rasters in the stack, drawing a random sample, across rasters to be summed and finally, drawing a random sample of cells to be summed.
For a sum on specified rasters in a stack, you can just index what you are after in the stack using a double bracket. In this case, rasters 1 and 3 in the stack would be the only ones summed.
library(raster)
rr <- raster(res = 10)
rr[] <- 1
rr <- stack(rr, rr, rr, rr)
( sum_1_3 <- calc(rr[[c(1,3)]], sum) )
If you are wanting a random sample of the values across rasters, for every cell, you could write a function that is passed to calc. Here is an example that grabs a random sample of n size, across the raster layers values and sums them.
rs.sum <- function(x, n=2) {sum( x[sample(1:length(x),n)], na.rm=TRUE)}
rs.sum.raster <- calc(rr, rs.sum)
If you are wanting to apply a function to a limited random selection of cells, you could create a random sample of the raster that would be used as an index. Here we create a random sample of cells, create an empty raster and pipe the sum of rasters 1 and 2 (in the stack) based on the random sample cell index. A raster in the stack is indexed using the double bracket and the raster values are indexed using a single bracket so, for raster 1 in the stack with limiting to the values in the random sample you would use: rr[[1]][rs]
rs <- sample(1:ncell(rr[[1]]), 300)
r.sum <- rr[[1]]
r.sum[] <- NA
r.sum[rs] <- rr[[1]][rs] + rr[[2]][rs]
plot(r.sum)

Draw a heatmap with "super big" matrix

I want to draw a heatmap.
I have 100k*100k square matrix (50Gb(csv), numbers on right-top side and other filled by 0).
I want to ask "How can I draw a heatmap with R?" with this huge dataset.
I'm trying to this code on large RAM machine.
d = read.table("data.csv", sep=",")
d = as.matrix(d + t(d))
heatmap(d)
I tried some libraries like heatmap.2(in gplots) or something.
But they are take so much time and memories.
What I suggest you is to heavily down-sample your matrix before plotting the heatmap, e.g. doing the mean of each submatrices (as suggested by #IaroslavDomin) :
# example of big mx 10k x 10 k
bigMx <- matrix(rnorm(10000*10000,mean=0,sd=100),10000,10000)
# here we downsample the big matrix 10k x 10k to 100x100
# by averaging each submatrix
downSampledMx <- matrix(NA,100,100)
subMxSide <- nrow(bigMx)/nrow(downSampledMx)
for(i in 1:nrow(downSampledMx)){
rowIdxs <- ((subMxSide*(i-1)):(subMxSide*i-1))+1
for(j in 1:ncol(downSampledMx)){
colIdxs <- ((subMxSide*(j-1)):(subMxSide*j-1))+1
downSampledMx[i,j] <- mean(bigMx[rowIdxs,colIdxs])
}
}
# NA to disable the dendrograms
heatmap(downSampledMx,Rowv=NA,Colv=NA)
For sure with your huge matrix it will take a while to compute the downSampledMx, but it should be feasible.
EDIT :
I think downsampling should preserve recognizable "macro-patterns", e.g. see the following example :
# create a matrix with some recognizable pattern
set.seed(123)
bigMx <- matrix(rnorm(50*50,mean=0,sd=100),50,50)
diag(bigMx) <- max(bigMx) # set maximum value on the diagonal
# set maximum value on a circle centered on the middle
for(i in 1:nrow(bigMx)){
for(j in 1:ncol(bigMx)){
if(abs((i - 25)^2 + (j - 25)^2 - 10^2) <= 16)
bigMx[i,j] <- max(bigMx)
}
}
# plot the original heatmap
heatmap(bigMx,Rowv=NA,Colv=NA, main="original")
# function used to down sample
downSample <- function(m,newSize){
downSampledMx <- matrix(NA,newSize,newSize)
subMxSide <- nrow(m)/nrow(downSampledMx)
for(i in 1:nrow(downSampledMx)){
rowIdxs <- ((subMxSide*(i-1)):(subMxSide*i-1))+1
for(j in 1:ncol(downSampledMx)){
colIdxs <- ((subMxSide*(j-1)):(subMxSide*j-1))+1
downSampledMx[i,j] <- mean(m[rowIdxs,colIdxs])
}
}
return(downSampledMx)
}
# downsample x 2 and plot heatmap
downSampledMx <- downSample(bigMx,25)
heatmap(downSampledMx,Rowv=NA,Colv=NA, main="downsample x 2")
# downsample x 5 and plot heatmap
downSampledMx <- downSample(bigMx,10)
heatmap(downSampledMx,Rowv=NA,Colv=NA, main="downsample x 5")
Here's the 3 heatmaps :

Resources