I want to create polygons inside an apply and want to do this as quickly as possible from a matrix of coordinates. I have some code and realized this is one of the slowest parts of my code. How can I do this efficiently? I tried two different approaches:
Approach 1
library(sp)
library(terra)
t0 <- Sys.time()
poly_list <- apply(matrix(1:10000), 1, function(idx){
# set coordinates
coords <- cbind(rnorm(100), rnorm(100))
# create polygon
Polygons(list(Polygon(coords)), idx)
})
# convert to terra polygons
poly_terra <- vect(SpatialPolygons(poly_list))
# show time passed
print(Sys.time() - t0)
# Time difference of 2.082166 secs
Approach 2
t0 <- Sys.time()
poly_list <- apply(matrix(1:10000), 1, function(idx){
# set coordinates
coords <- cbind(rnorm(100), rnorm(100))
# create polygon
vect(coords, type = "polygon")
})
# convert to terra polygons
poly_terra <- vect(poly_list)
print(Sys.time() - t0)
# Time difference of 16.38044 secs
Why is it faster to create sp polygons and convert them afterwards than directly creating terra polygons? The code with vect(SpatialPolygons(Polygons(list(Polygon(coords)), idx))) seems somewhat complicated. Is there a faster or at least more elegant way?
Edit Currently my fastest option, although it feels illegal:
t0 <- Sys.time()
dummy <- Polygons(list(Polygon(cbind(rep(0,4), rep(0,4)))), "0")
poly_list <- apply(matrix(1:10000), 1, function(idx){
# set coordinates
coords <- cbind(rnorm(100), rnorm(100))
# create polygon
new <- dummy
new#ID <- as.character(idx)
new#Polygons[[1]]#coords <- coords
return(new)
})
# convert to terra polygons
poly_terra <- vect(SpatialPolygons(poly_list))
print(Sys.time() - t0)
# Time difference of 0.7147191 secs
This is faster than your examples
t0 <- Sys.time()
p <- lapply(1:10000, function(idx){
cbind(id=idx, x=rnorm(100), y=rnorm(100))
})
p <- do.call(rbind, p)
v <- vect(p, "polygons")
print(Sys.time() - t0)
#Time difference of 0.483578 secs
This uses lapply and you state that you want to use apply; but in the context of your example apply does not seem to be a good choice.
I do not see much performance difference between your two sp approaches. Below I use a streamlined version of the one you say is fastest and benchmark it with my approach:
with_terra <- function() {
p <- lapply(1:10000, function(idx){
cbind(id=idx, x=rnorm(100), y=rnorm(100))
})
p <- do.call(rbind, p)
vect(p, "polygons")
}
with_sp <- function() {
dummy <- Polygons(list(Polygon(cbind(rep(0,4), rep(0,4)))), "0")
poly_list <- apply(matrix(1:10000), 1, function(idx){
dummy#ID <- as.character(idx)
dummy#Polygons[[1]]#coords <- cbind(rnorm(100), rnorm(100))
dummy
})
vect(SpatialPolygons(poly_list))
}
bm <- microbenchmark::microbenchmark(
sp = with_sp(),
terra = with_terra(),
times = 10
)
bm
#Unit: milliseconds
# expr min lq mean median uq max neval
# sp 836.8434 892.8411 930.5261 935.3788 968.2724 1039.2840 10
# terra 261.2191 276.0770 298.3603 282.7462 296.3674 437.0505 10
I'm not really sure if this will be of help, but I had some good time experimenting and fine-tuning and thought I'll share my preliminary results at least.
Foremost, let me share some input for further reading:
This article was the starting point for some code optimization I worked on some time ago: FasteR! HigheR! StrongeR!
Maybe it's just me, but I prefer for-loops over apply and this approach does not seem to be slower (c.f. here). On the contrary, the median execution time of your first approach was ~0.12 s faster on my machine after I used a loop instead, but maybe there is another reason for you using apply here.
If you choose to go for a loop, here is another great guide how to reduce execution time.
Making use of namespaces actually slows down your code (c.f. here), so better attach the packages you are going to use - like you did.
The native pipe does not seem to have any overhead (c.f. here), so this might be a great way to un-nest your functions and make them look tidier without penalties.
For timing purposes, I came across {tictoc} some time ago as a nice implementation of Sys.time() - t0 from my point of view, for actual benchmarking, {benchmarking} is great.
Noam Ross suggests to find better (= faster) packages for your purpose. You already noticed {sp} operates faster than {terra} with your example. Let me present a third option:
library(sp)
library(terra)
library(sf)
# first node has to be equal to the last node for a polygon to be closed
coords <- cbind(rnorm(99), rnorm(99))
coords <- rbind(coords, coords[1, ])
mbm <- microbenchmark::microbenchmark(
sp = Polygon(coords) |> list() |> Polygons(1),
terra = vect(coords, type = "polygons"),
sf = list(coords) |> st_polygon(),
times = 100
)
ggplot2::autoplot(mbm)
If your target object has to be of class SpatVector, you may consider applying terra::vect() once as a final step. However, what exactly is your goal once you created your polygon objects? This might affect which package / workflow to use. E.g. if you only need geometries in a specific order, you might drop attributes etc.
Considering your third approach - it should not feel illegal from my point of view, pre-allocating objects and exchanging some attributes seems like a smart move to do - a condensate might encompass loops, pipes and maybe the {sf} package, whereat I failed implementing the latter at the moment, but at least I did not slow down your code so far:
# your take on this
illegal_approach_a <- function() {
dummy <- Polygons(list(Polygon(cbind(rep(0,4), rep(0,4)))), "0")
poly_list <- apply(matrix(1:10000), 1, function(idx){
# set coordinates
coords <- cbind(rnorm(100), rnorm(100))
# create polygon
new <- dummy
new#ID <- as.character(idx)
new#Polygons[[1]]#coords <- coords
return(new)
})
# convert to terra polygons
poly_terra <- vect(SpatialPolygons(poly_list))
}
# my take on this
illegal_approach_b <- function() {
dummy <- cbind(rep(0, 4), rep(0, 4)) |> Polygon() |> list() |> Polygons("0")
poly_list <- list()
for (i in 1:10000) {
# set coordinates
coords <- cbind(rnorm(100), rnorm(100))
# create polygon
new <- dummy
new#ID <- as.character(i)
new#Polygons[[1]]#coords <- coords
poly_list[[i]] <- new
}
# convert to terra polygons
poly_terra <- SpatialPolygons(poly_list) |> vect()
}
mbm <- microbenchmark::microbenchmark(
your_take = illegal_approach_a(),
my_take = illegal_approach_b(),
times = 100
)
ggplot2::autoplot(mbm)
I am a relatively new R programmer and have written a script that takes some statistical results and will ultimately compare it to a vector of results in which the target variable has been randomized. The result vector contains the statistical results of n simulations. As the number of simulations increases (I would like to run 10,000 simulations at least) the run time is longer than I would like. I have tried increasing the performance in ways I know to modify the code, but would love the help of others in optimizing it. The relevant part of the code is below.
#CREATE DATA
require(plyr)
Simulations <- 10001
Variation <- c("Control", "A", "B","C")
Trials <- c(727,724,723,720)
NonResponse <- c(692,669,679,682)
Response <- c(35,55,44,38)
ConfLevel <- .95
#PERFORM INITIAL CALCS
NonResponse <- Trials-Response
Data <-data.frame(Variation, NonResponse, Response, Trials)
total <- ddply(Data,.(Variation),function(x){data.frame(value = rep(c(0,1),times = c(x$NonResponse,x$Response)))})
total <- total[sample(1:nrow(total)), ]
colnames(total) <- c("Variation","Response")
#CREATE FUNCTION TO PERFORM SIMULATIONS
targetshuffle <- function(x)
{
shuffle_target <- x[,"Response"]
shuffle_target <- data.frame(sample(shuffle_target))
revised <- cbind(x[,"Variation"], shuffle_target)
colnames(revised) <- c("Variation","Yes")
yes_variation <- data.frame(table(revised$Yes,revised$Variation))
colnames(yes_variation) <- c("Yes","Variation","Shuffled_Response")
Shuffled_Data <- subset(yes_variation, yes_variation$Yes==1)
Shuffled_Data <- Shuffled_Data[match(Variation, Shuffled_Data$Variation),]
yes_variation <- cbind(Data,Shuffled_Data)
VectorPTest_All <- yes_variation[,c("Variation","NonResponse","Response","Trials","Shuffled_Response")]
Control_Only <- yes_variation[yes_variation$Variation=="Control",]
VectorPTest_Chall <- subset(yes_variation,!(Variation=="Control"))
VectorPTest_Chall <- VectorPTest_Chall[,c("Variation","NonResponse","Response","Trials","Shuffled_Response")]
ControlResponse <- Control_Only$Response
ControlResponseRevised <- Control_Only$Shuffled_Response
ControlTotal <- Control_Only$Trials
VariationCount <- length(VectorPTest_Chall$Variation)
VP <- data.frame(c(VectorPTest_Chall,rep(ControlResponse),rep(ControlResponseRevised),rep(ControlTotal)))
names(VP) <- c("Variation","NonResponse","Response", "Trials", "ResponseShuffled", "ControlReponse",
"ControlResponseShuffled","ControlTotal")
VP1 <<- data.frame(VP[,c(5,7,4,8)])
VP2 <<- data.frame(VP[,c(3,6,4,8)])
ptest <- apply(VP1, 1, function(column) prop.test(x=c(column[1], column[2]),
n=c(column[3], column[4]), alternative="two.sided",
conf.level=ConfLevel, correct=FALSE)$p.value)
min_p_value <- min(ptest)
return(min_p_value)
}
#CALL FUNCTION
sim_result <- do.call(rbind, rlply(Simulations, targetshuffle(total)))
Offhand, one thing to look at is creating all the data frames. Each time you do that you're copying all the data in the constituent object. If the dimensions are predictable you might consider creating empty matrices at the beginning of the function and populating them as you go.
I have a raster map with many patches (clumps of continguous cells with the same value). What I need to do is to obtain the coordinates of the center (or close to the center) of each patch.
I am very unexperienced with raster package but it seems I can get coordinates only if I know the position of the cells in the map. Is there any way to get coordinates giving a value of the cells instead? Thank you
If by patch you mean clumps, Raster package allows you to find , and isolate, clumps. Taking the clump() raster package example, and extending it:
library(raster)
library(igraph)
detach("package:coin", unload=TRUE)
r <- raster(ncols=12, nrows=12)
set.seed(0)
r[] <- round(runif(ncell(r))*0.7 )
rc <- clump(r)
clump_id <- getValues(rc)
xy <- xyFromCell(rc,1:ncell(rc))
df <- data.frame(xy, clump_id, is_clump = rc[] %in% freq(rc, useNA = 'no')[,1])
df[df$is_clump == T, ]
plot(r)
plot(rc)
text(df[df$is_clump == T, 1:2], labels = df[df$is_clump == T, 3])
May not be as interesting as you could expect.
You do it all over with directions = 4
rc <- clump(r, directions = 4)
clump_id <- getValues(rc)
xy <- xyFromCell(rc,1:ncell(rc))
df <- data.frame(xy, clump_id, is_clump = rc[] %in% freq(rc, useNA = 'no')[,1])
df[df$is_clump == T, ]
to get
and maybe clump 'centroids'
dfm <- ddply(df[df$is_clump == T, ], .(clump_id), summarise, xm = mean(x), ym = mean(y))
plot(rc)
text(dfm[, 2:3], labels = dfm$clump_id)
Notes:
There will be an error if you try to use clump() without first
detach modeltools library. modeltools is called by coin and maybe
other statistical libraries.
You could take the mean of the coordinates of each patch:
# some dummy data
m <- matrix(c(
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,4,4,0,
0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,4,4,0,
0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,4,4,0,
0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,
0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,
0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,
0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,
0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,
0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,
0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,
0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,
0,0,2,3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,
0,0,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), nrow=20, byrow=T)
# create a raster
r <- raster(m)
# convert raster to points
p <- data.frame(rasterToPoints(r))
# filter out packground
p <- p[p$layer > 0,]
# for each patch calc mean coordinates
sapply(split(p[, c("x", "y")], p$layer), colMeans)