I need to rapidly determine if a spatial polygon and a spatial line intersect. I am currently converting the polgon to a spatial line and using gIntersection(). Can anyone suggest a potentially quicker method? Perhaps using rasters instead of spatial line or something. I need to do this many thousands of times.
# .shp file to Spatial Line
polygon1 <- readShapeSpatial("C:.../SALandmass.shp")
polygon1filled <- SpatialPolygons(list(Polygons(list(polygon1#polygons[[1]]#Polygons[[1]]),ID=1)))
SL <- as(polygon1filled, "SpatialLines")
# Test if line between two coordinates cross the shape
Pt1 = list(x = c(CurrentLong, MapCoordsm$x[i]), y = c(CurrentLat, MapCoordsm$y[i]))
SpatialLine1 = SpatialLines(list(Lines(Line(cbind(Pt1$x,Pt1$y)), "L1")))
cross <- length(gIntersection(SpatialLine1, SL))
Where gIntersection returns a geometry with the intersection, gIntersects returns a logical indicating whether two geometries intersect.
Thanks for the input Edzer, and others.
I ran some test on your suggestions and it seems gIntersects() makes a huge difference. It is basically the same to convert the polygon to spatial lines first or just use the polygon.
Here are the test results:
# Original approach
system.time({
Pt1 = list(x = c(long1, long2), y = c(lat1, lat2))
SpatialLine1 = SpatialLines(list(Lines(Line(cbind(Pt1$x,Pt1$y)), "L1")))
cross <- length(gIntersection(SpatialLine1, SL))
})
## user system elapsed
## 0.53 0.00 0.53
# Edzer suggestion: using gIntersects
system.time({
Pt1 = list(x = c(long1, long2), y = c(lat1, lat2))
SpatialLine1 = SpatialLines(list(Lines(Line(cbind(Pt1$x,Pt1$y)), "L1")))
cross <- (gIntersects(SpatialLine1, SL))
})
# user system elapsed
# 0.06 0.00 0.06
# Edzer suggestion 2: using a polygon rather that spacial lines
system.time({
Pt1 = list(x = c(long1, long2), y = c(lat1, lat2))
SpatialLine1 = SpatialLines(list(Lines(Line(cbind(Pt1$x,Pt1$y)), "L1")))
cross <- length(gIntersection(SpatialLine1, polygon1filled))
})
# user system elapsed
# 0.43 0.05 0.48
# Edzer suggestion 1&2: using a polygon rather that spacial lines and gIntersects
system.time({
Pt1 = list(x = c(long1, long2), y = c(lat1, lat2))
SpatialLine1 = SpatialLines(list(Lines(Line(cbind(Pt1$x,Pt1$y)), "L1")))
cross <- (gIntersects(SpatialLine1, polygon1filled))
})
# user system elapsed
# 0.06 0.00 0.07
Related
I need to do calculations on a lot of big rasters(28000 cells, 181 layers). I tried my code on a small subset (24cells, 181 layers). I took help from this forum to optimize as much as I could.
Now I used bricks in raster package because I read that bricks are loaded into memory and are faster to process. But then some suggest that terra is better and easier than raster.
I used both packages to run my code and I see that raster is quicker in my case (9min compared with 16min). Now this time is for a small dataset. When I run it on original data the computer takes forever. When I check CPU and RAM use of computer during the process its around 16% and about 1GB respectively. Even if my code is inefficient, why isn't R using available RAM and CPU.
What I'm trying to do is implementing a model where I have to interpolate Landsat NDVI to daily timestep using spline. Then calculate different variables using mathematical equations. Some are really simple and straightforward, but some are really complex.
What I want is an efficient way of calculating different variables.
Would really appreciate if someone can explain:
Even if my code is inefficient what I believe is the computer should use maximum available resources. And is there a way to do that?
Is raster package better in my case?
I see some are recommending parallel processing. As I'm not a programmer so I'll have to read about it before implementing. I do know what it is though.
Apologies for not presenting the queries as they should be 1st time. Hope this is in better shape now. Thanks
Here is reproducible terra code (special thanks to Robert Hijmans):
library(terra)
b <- r1 <- r2 <- rast(ncols=5, nrows=5, nl=5, vals=NA)
set.seed(0)
values(b) <- runif(size(b))
b[c(1,2,3,22,23,24,25)] <- NA
p <- 0.15
p1 <- p/3
p2 <- p-(p/3)
fc <- 0.3
weather <- c(0.1, 0, 0, 0, 0.3)
r2[[1]] <- ifel(is.na(b[[1]]), NA, 0.3)
for (i in 1:nlyr(b)) {
varr1 <- b[[i]] * (((r2[[i]] - p1)/p2)^2)
r1[[i]] <- ifel(r2[[i]] > p, b[[i]], varr1)
for (k in 2:nlyr(b)) {
r2[[k]] <- min(r2[[k-1]] + (weather[k-1] - r1[[k-1]]) /100, fc)
}
}
Here is reproducible raster code:
library(raster)
b <- brick(ncols=5, nrows=5, nl=5)
inBrick <- setValues(b, runif(ncell(b) * nlayers(b)))
inBrick[c(1,2,3,22,23,24,25)] <- NA
outBrick1 <- inBrick
outBrick1[] <- NA
outBrick2 <- outBrick1
ini <- 0.3
p <- 0.15
p1 <- p/3
p2 <- p-(p/3)
fc <- 0.3
var1 <- which(!is.na(inBrick[[1]][]))
outBrick2[[1]][var1] <- ini
### now outBrick2 has initial values in 1st layer
weather <- c(0.1, 0, 0, 0, 0.3)
var3 <- 1:ncell(inBrick)
### outBrick1 Calculations
for (i in 1:nlayers(inBrick)) {
varr1 <- inBrick[[i]][]*(((outBrick2[[i]][]-p1)/(p2))^2)
for (j in 1:ncell(inBrick)) {
if(!is.na(outBrick2[[i]][j])){
if(outBrick2[[i]][j]>p){
outBrick1[[i]][j] <- inBrick[[i]][j]
}else{
outBrick1[[i]][j] <- varr1[j]
}
}
}
###outBrick2 Calculations
for (k in 2:nlayers(inBrick)) {
var2 <- outBrick2[[k-1]][] + (weather[k-1]-outBrick1[[k-1]][])/100
for(l in 1:ncell(inBrick)){
var3[l] <- min(fc, var2[l])
}
outBrick2[[k]][] <- var3
}
}
"terra" is generally faster, sometimes much faster. nukubiho points out that with arithmetic computations "raster" may be faster than "terra". However, the difference is generally small, and may only be true when the cell values are in memory.
However, this may not hold for large datasets (where these differences may actually matter). The cell values for these datasets are typically in a file.
In memory:
library("raster")
library("terra")
n = 12000
x_terra = rast(nrows = n, ncols = n, vals = rnorm(n ^ 2))
y_terra = rast(nrows = n, ncols = n, vals = rnorm(n ^ 2))
x_raster = raster(x_terra)
y_raster = raster(y_terra)
r <- c(x_terra, y_terra)
system.time({ (x_raster - y_raster) / (x_raster + y_raster) })
# user system elapsed
# 1.83 0.36 2.19
system.time({ (x_terra - y_terra) / (x_terra + y_terra) })
# user system elapsed
# 2.66 2.25 4.91
system.time(app(r, \(x) (x[,1]-x[,2]) / (x[,1] + x[,2])))
# user system elapsed
# 4.06 2.17 6.25
"raster" is faster. But when the values are in files, "terra" is faster.
x_terra = writeRaster(x_terra, "test1.tif", overwrite=T)
y_terra = writeRaster(y_terra, "test2.tif", overwrite=T)
x_raster = raster(x_terra)
y_raster = raster(y_terra)
r <- c(x_terra, y_terra)
system.time({ (x_raster - y_raster) / (x_raster + y_raster) })
# user system elapsed
# 17.25 5.39 22.66
system.time({ (x_terra - y_terra) / (x_terra + y_terra) })
# user system elapsed
# 13.63 4.92 18.61
system.time(app(r, \(x) (x[,1]-x[,2]) / (x[,1] + x[,2])))
# user system elapsed
# 9.78 3.46 13.24
Generally, terra is faster than raster, but one exception is in-memory arithmetic calculations. Probably the reason is that raster uses heavily optimized R code. See this simple example:
library("raster")
library("terra")
n = 12000
x_terra = rast(nrows = n, ncols = n, vals = rnorm(n ^ 2))
y_terra = rast(nrows = n, ncols = n, vals = rnorm(n ^ 2))
x_raster = raster(x_terra)
y_raster = raster(y_terra)
## terra
system.time({ (x_terra - y_terra) / (x_terra + y_terra) })
#> user system elapsed
#> 2.63 2.22 4.86
## raster
system.time({ (x_raster - y_raster) / (x_raster + y_raster) })
#> user system elapsed
#> 1.78 0.26 2.05
If you want to use more CPU and RAM, then maybe you should split data into blocks and process them in parallel? See these questions:
terra package returns error when try to run parallel operations
Process sets of rasters in parallel using lapp function from terra package
I'd like to generate 100 random points but imposed a maximal distance around points using st_buffer() of size 1000 meters around each point, and eliminating any offending points. But, in my example:
library(sf)
# Data set creation
set.seed(1)
df <- data.frame(
gr = c(rep("a",5),rep("b",5)),
x = rnorm(10),
y = rnorm(10)
)
df <- st_as_sf(df,coords = c("x","y"),remove = F, crs = 4326)
df.laea = st_transform(df,
crs = "+proj=laea +x_0=4600000 +y_0=4600000 +lon_0=0.13 +lat_0=0.24 +datum=WGS84 +units=m")
st_bbox(df.laea)
#
# Random simulation of 100 point inside df.laea extent
sim_study_area <- st_sample(st_as_sfc(st_bbox(df.laea)), 100) %>% # random points, as a list ...
st_sf()
border_area <- st_as_sfc(st_bbox(df.laea))%>% # random points, as a list ...
st_sf()
# I'd like to imposed a maximal distance of 1000 meters around points and for this:
i <- 1 # iterator start
buffer_size <- 1000 # minimal distance to be enforced (in meters)
repeat( {
# create buffer around i-th point
buffer <- st_buffer(sim_study_area[i,], buffer_size)
offending <- sim_study_area %>% # start with the intersection of master points...
st_intersects(buffer, sparse = F) # ... and the buffer, as a vector
# i-th point is not really offending
offending[i] <- TRUE
# if there are any offending points left - re-assign the master points
sim_study_area <- sim_study_area[offending,]
if ( i >= nrow(sim_study_area)) {
# the end was reached; no more points to process
break
} else {
# rinse & repeat
i <- i + 1
}
} )
# Visualizantion of points create with the offending condition:
simulation_area <- ggplot() +
geom_sf(data = border_area, col = 'gray40', fill = NA, lwd = 1) +
geom_sf(data = sim_study_area, pch = 3, col = 'red', alpha = 0.67) +
theme_bw()
plot(simulation_area)
It's not OK result because a don't have 100 points and I don't know how I can fix it.
Please any ideas?
Thanks in advance,
Alexandre
I think that the easiest solution is to adopt one of the sampling functions defined in the R package spatstat. For example:
# packages
library(sf)
#> Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
# create data
set.seed(1)
df <- data.frame(
gr = c(rep("a",5),rep("b",5)),
x = rnorm(10),
y = rnorm(10)
)
df <- st_as_sf(df,coords = c("x","y"),remove = F, crs = 4326)
df.laea = st_transform(
df,
crs = "+proj=laea +x_0=4600000 +y_0=4600000 +lon_0=0.13 +lat_0=0.24 +datum=WGS84 +units=m"
)
Now we sample with a Simple Sequential Inhibition Process. Check ?spatstat.core::rSSI for more details.
sampled_points <- st_sample(
x = st_as_sfc(st_bbox(df.laea)),
type = "SSI",
r = 1000, # threshold distance (in metres)
n = 100 # number of points
)
# Check result
par(mar = rep(0, 4))
plot(st_as_sfc(st_bbox(df.laea)), reset = FALSE)
plot(sampled_points, add = TRUE, pch = 16)
# Estimate all distances
all_distances <- st_distance(sampled_points)
all_distances[1:5, 1:5]
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.00 57735.67 183205.74 189381.50 81079.79
#> [2,] 57735.67 0.00 153892.93 143755.73 61475.85
#> [3,] 183205.74 153892.93 0.00 62696.68 213379.39
#> [4,] 189381.50 143755.73 62696.68 0.00 194237.12
#> [5,] 81079.79 61475.85 213379.39 194237.12 0.00
# Check they are all greater than 1000
sum(all_distances < 1000)
#> [1] 100 # since the diagonal is full of 100 zeros
Created on 2021-08-12 by the reprex package (v2.0.0)
Check here (in particular the answer from Prof. Baddeley), the references therein, and the help page of st_sample for more details.
Imagine a regular 0.5° grid across the Earth's surface. A 3x3 subset of this grid is shown below. As a stylized example of what I'm working with, let's say I have three polygons—yellow, orange, and blue—that for the sake of simplicity all are 1 unit in area. These polygons have attributes Population and Value, which you can see in the legend:
I want to turn these polygons into a 0.5° raster (with global extent) whose values are based on the weighted-mean Value of the polygons. The tricky part is that I want to weight the polygons' values based on not their Population, but rather on their included population.
I know—theoretically—what I want to do, and below have done it for the center gridcell.
Multiply Population by Included (the area of the polygon that is included in the gridcell) to get Pop. included. (Assumes population is distributed evenly throughout polygon, which is acceptable.)
Divide each polygon's Included_pop by the sum of all polygons' Included_pop (32) to get Weight.
Multiply each polygon's Value by Weight to get Result.
Sum all polygons' Result to get the value for the center gridcell (0.31).
Population
Value
Frac. included
Pop. included
Weight
Result
Yellow
24
0.8
0.25
6
0.1875
0.15
Orange
16
0.4
0.5
8
0.25
0.10
Blue
18
0.1
1
18
0.5625
0.06
32
0.31
I have an idea of how to accomplish this in R, as described below. Where possible, I've filled in code that I think will do what I want. My questions: How do I do steps 2 and 3? Or is there a simpler way to do this? If you want to play around with this, I have uploaded old_polygons as a .rds file here.
library("sf")
library("raster")
Calculate the area of each polygon: old_polygons$area <- as.numeric(st_area(old_polygons))
Generate the global 0.5° grid as some kind of Spatial object.
Split the polygons by the grid, generating new_polygons.
Calculate area of the new polygons: new_polygons$new_area <- as.numeric(st_area(new_polygons))
Calculate fraction included for each new polygon: new_polygons$frac_included <- new_polygons$new_area / new_polygons$old_area
Calculate "included population" in the new polygons: new_polygons$pop_included <- new_polygons$pop * new_polygons$frac_included
Calculate a new attribute for each polygon that is just their Value times their included population. new_polygons$tmp <- new_polygons$Value * new_polygons$frac_included
Set up an empty raster for the next steps: empty_raster <- raster(nrows=360, ncols=720, xmn=-180, xmx=180, ymn=-90, ymx=90)
Rasterize the polygons by summing this new attribute together within each gridcell. tmp_raster <- rasterize(new_polygons, empty_raster, "tmp", fun = "sum")
Create another raster that is just the total population in each gridcell: pop_raster <- rasterize(new_polygons, empty_raster, "pop_included", fun = "sum")
Divide the first raster by the second to get what I want:
output_raster <- empty_raster
values(output_raster) <- getValues(tmp_raster) / getValues(pop_raster)
Any help would be much appreciated!
Example data:
library(terra)
f <- system.file("ex/lux.shp", package="terra")
v <- vect(f)
values(v) <- data.frame(population=1:12, value=round(c(2:13)/14, 2))
r <- rast(ext(v)+.05, ncols=4, nrows=6, names="cell")
Illustrate the data
p <- as.polygons(r)
plot(p, lwd=2, col="gray", border="light gray")
lines(v, col=rainbow(12), lwd=2)
txt <- paste0(v$value, " (", v$population, ")")
text(v, txt, cex=.8, halo=TRUE)
Solution:
# area of the polygons
v$area1 <- expanse(v)
# intersect with raster cell boundaries
values(r) <- 1:ncell(r)
p <- as.polygons(r)
pv <- intersect(p, v)
# area of the polygon parts
pv$area2 <- expanse(pv)
pv$frac <- pv$area2 / pv$area1
Now we just use the data.frame with the attributes of the polygons to compute the polygon-cover-weighted-population-weighted values.
z <- values(pv)
a <- aggregate(z[, "frac", drop=FALSE], z[,"cell",drop=FALSE], sum)
names(a)[2] <- 'fsum'
z <- merge(z, a)
z$weight <- z$population * z$frac / z$fsum
z$wvalue <- z$value * z$weight
b <- aggregate(z[, c("wvalue", "weight")], z[, "cell", drop=FALSE], sum)
b$bingo <- b$wvalue / b$weight
Assign values back to raster cells
x <- rast(r)
x[b$cell] <- b$bingo
Inspect results
plot(x)
lines(v)
text(x, digits=2, halo=TRUE, cex=.9)
text(v, "value", cex=.8, col="red", halo=TRUE)
This may not scale very well to large data sets, but you could perhaps do it in chunks.
This is fast and scalable:
library(data.table)
library(terra)
# make the 3 polygons with radius = 5km
center_points <- data.frame(lon = c(0.5, 0.65, 1),
lat = c(0.75, 0.65, 1),
Population = c(16, 18, 24),
Value = c(0.4, 0.1, 0.8))
polygon <- vect(center_points, crs = "EPSG:4326")
polygon <- buffer(polygon, 5000)
# make the raster
my_raster <- rast(nrow = 3, ncol = 3, xmin = 0, xmax = 1.5, ymin = 0, ymax = 1.5, crs = "EPSG:4326")
my_raster[] <- 0 # set the value to 0 for now
# find the fractions of cells in each polygon
# "cells" gives you the cell ID and "weights" (or "exact") gives you the cell fraction in the polygon
# using "exact" instead of "weights" is more accurate
my_Table <- extract(my_raster, polygon, cells = TRUE, weights = TRUE)
setDT(my_Table) # convert to datatable
# merge the polygon attributes to "my_Table"
poly_Table <- setDT(as.data.frame(polygon))
poly_Table[, ID := 1:nrow(poly_Table)] # add the IDs which are the row numbers
merged_Table <- merge(my_Table, poly_Table, by = "ID")
# find Frac_included
merged_Table[, Frac_included := weight / sum(weight), by = ID]
# find Pop_included
merged_Table[, Pop_included := Frac_included * Population]
# find Weight, to avoid confusion with "weight" produced above, I call this "my_Weight"
merged_Table[, my_Weight := Pop_included / sum(Pop_included), by = cell]
# final results
Result <- merged_Table[, .(Result = sum(Value * my_Weight)), by = cell]
# add the values to the raster
my_raster[Result$cell] <- Result$Result
plot(my_raster)
I am a beginner of R and would like to get some help from experts.
I want to create a function to calculate risk under 3 conditions
1st: control: Xhh=0 Xmi=0
2st: hh: Xhh=1 Xmi=0
3rd: hh+mi: Xhh=1 Xmi=1
and compare between 2 groups
group 1: Xenv=50
group 2: Xenv=90
my parameters:
thi lambda1 lambda2 lambda3 Beta Z2 Z1 Z4 Z3 Z6 Z5 theta
1.38 0.34 0.25 0.49 0.5 0.58 0.55 0.59 0.56 0.44 0.61 0.88
I want to plug in all these parameters into this equation
http://i.stack.imgur.com/DYR81.png
and calculate the value at different time points,
Ti = from 1 to 10
For Ti=0, make the value at 0
and then plot a graph with the value at different time point, with 3 curves of these three conditions and compared between the two groups. So at the end having 6 curves in the graph.
Can anyone please offer some help?
This is probably a little more interpretation of your needs than you will tend to get here, but I was looking for a reason to procrastinate at other tasks this morning.
I think this meets your needs as stated:
#define a function called myfunction
myfunction <- function(lambda1 = 0.34
,lambda2 = 0.25
,lambda3 = 0.49
,Beta = 0.5
,Z2 = 0.58
,Z1 = 0.55
,Z4 = 0.59
,Z3 = 0.56
,Z6 = 0.44
,Z5 = 0.61
,theta = 0.88
,Xenv
,Xhi
,Xmi
,Tmin
,Tmax){
#this should make this function somewhat generalizable to values of T
#create an empty vector to hold values of our function as defined
f <- rep(NA, length(Tmin:Tmax))
#loop through values of T in your function
#check parentheses here - I make no promises
#I'm also unclear what your value of thi is. I may be missing something,
#but I don't see it in the function you have written
i <- Tmin:Tmax
f <- -log(exp((-(lambda1*i)^theta)*exp(log10(1-Beta*Xhi)+log10((Xenv/100)*(Z2-Z1)+Z2))-
((lambda2*i)^theta)*exp(log10(1-Beta*Xmi))*log10((Xenv/100)*(Z4-Z3)+Z4)-
((lambda3*i)^theta)*exp(log10((Xenv/100)*(Z6-Z5)+Z6)))*(1-theta)+theta)
#set f=0 at T=0 (I think this is what you want)
if(Tmin==0) f[1] <- 0
return(f)
}
#you didn't specify how to plot, but this seems to lend itself to a ggplot facted viz.
require(ggplot2)
require(reshape2)
#calculate for group 1
datg1 <- data.frame(t = 0:10
,group = 1
,condition1 = myfunction(Xenv=50, Xhi=0, Xmi=0, Tmin=0, Tmax=10)
,condition2 = myfunction(Xenv=50, Xhi=1, Xmi=0, Tmin=0, Tmax=10)
,condition3 = myfunction(Xenv=50, Xhi=1, Xmi=1, Tmin=0, Tmax=10)
)
#calculate for group 2
datg2 <- data.frame(t = 0:10
,group = 2
,condition1 = myfunction(Xenv=90, Xhi=0, Xmi=0, Tmin=0, Tmax=10)
,condition2 = myfunction(Xenv=90, Xhi=1, Xmi=0, Tmin=0, Tmax=10)
,condition3 = myfunction(Xenv=90, Xhi=1, Xmi=1, Tmin=0, Tmax=10)
)
#bind values together
dat <- rbind(datg1, datg2)
#melt your data into long format
datm <- melt(dat, id.vars = c("t", "group"))
#plot and facet
ggplot(datm, aes(x=t, y=value, colour=variable)) +
geom_line() +
facet_grid(.~group)
I'm running k-means clustering on a data frame df1, and I'm looking for a simple approach to computing the closest cluster center for each observation in a new data frame df2 (with the same variable names). Think of df1 as the training set and df2 on the testing set; I want to cluster on the training set and assign each test point to the correct cluster.
I know how to do this with the apply function and a few simple user-defined functions (previous posts on the topic have usually proposed something similar):
df1 <- data.frame(x=runif(100), y=runif(100))
df2 <- data.frame(x=runif(100), y=runif(100))
km <- kmeans(df1, centers=3)
closest.cluster <- function(x) {
cluster.dist <- apply(km$centers, 1, function(y) sqrt(sum((x-y)^2)))
return(which.min(cluster.dist)[1])
}
clusters2 <- apply(df2, 1, closest.cluster)
However, I'm preparing this clustering example for a course in which students will be unfamiliar with the apply function, so I would much prefer if I could assign the clusters to df2 with a built-in function. Are there any convenient built-in functions to find the closest cluster?
You could use the flexclust package, which has an implemented predict method for k-means:
library("flexclust")
data("Nclus")
set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)
dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE
cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
# 1 2 3 4
#130 181 98 91
pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])
image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")
There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:
as.kcca(cl, data=x)
# kcca object of family ‘kmeans’
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
# 1 2
# 50 50
Something I noticed about both the approach in the question and the flexclust approaches are that they are rather slow (benchmarked here for a training and testing set with 1 million observations with 2 features each).
Fitting the original model is reasonably fast:
set.seed(144)
df1 <- data.frame(x=runif(1e6), y=runif(1e6))
df2 <- data.frame(x=runif(1e6), y=runif(1e6))
system.time(km <- kmeans(df1, centers=3))
# user system elapsed
# 1.204 0.077 1.295
The solution I posted in the question is slow at calculating the test-set cluster assignments, since it separately calls closest.cluster for each test-set point:
system.time(pred.test <- apply(df2, 1, closest.cluster))
# user system elapsed
# 42.064 0.251 42.586
Meanwhile, the flexclust package seems to add a lot of overhead regardless of whether we convert the fitted model with as.kcca or fit a new one ourselves with kcca (though the prediction at the end is much faster)
# APPROACH #1: Convert from the kmeans() output
system.time(km.flexclust <- as.kcca(km, data=df1))
# user system elapsed
# 87.562 1.216 89.495
system.time(pred.flexclust <- predict(km.flexclust, newdata=df2))
# user system elapsed
# 0.182 0.065 0.250
# Approach #2: Fit the k-means clustering model in the flexclust package
system.time(km.flexclust2 <- kcca(df1, k=3, kccaFamily("kmeans")))
# user system elapsed
# 125.193 7.182 133.519
system.time(pred.flexclust2 <- predict(km.flexclust2, newdata=df2))
# user system elapsed
# 0.198 0.084 0.302
It seems that there is another sensible approach here: using a fast k-nearest neighbors solution like a k-d tree to find the nearest neighbor of each test-set observation within the set of cluster centroids. This can be written compactly and is relatively speedy:
library(FNN)
system.time(pred.knn <- get.knnx(km$center, df2, 1)$nn.index[,1])
# user system elapsed
# 0.315 0.013 0.345
all(pred.test == pred.knn)
# [1] TRUE
You can use the ClusterR::KMeans_rcpp() function, use RcppArmadillo. It allows for multiple initializations (which can be parallelized if Openmp is available). Besides optimal_init, quantile_init, random and kmeans ++ initilizations one can specify the centroids using the CENTROIDS parameter. The running time and convergence of the algorithm can be adjusted using the num_init, max_iters and tol parameters.
library(scorecard)
library(ClusterR)
library(dplyr)
library(ggplot2)
## Generate data
set.seed(2019)
x = c(rnorm(200000, 0,1), rnorm(150000, 5,1), rnorm(150000,-5,1))
y = c(rnorm(200000,-1,1), rnorm(150000, 6,1), rnorm(150000, 6,1))
df <- split_df(data.frame(x,y), ratio = 0.5, seed = 123)
system.time(
kmrcpp <- KMeans_rcpp(df$train, clusters = 3, num_init = 4, max_iters = 100, initializer = 'kmeans++'))
# user system elapsed
# 0.64 0.05 0.82
system.time(pr <- predict_KMeans(df$test, kmrcpp$centroids))
# user system elapsed
# 0.01 0.00 0.02
p1 <- df$train %>% mutate(cluster = as.factor(kmrcpp$clusters)) %>%
ggplot(., aes(x,y,color = cluster)) + geom_point() +
ggtitle("train data")
p2 <- df$test %>% mutate(cluster = as.factor(pr)) %>%
ggplot(., aes(x,y,color = cluster)) + geom_point() +
ggtitle("test data")
gridExtra::grid.arrange(p1,p2,ncol = 2)