I am on the lookout for a faster alternative to R's hist(x, breaks=XXX, plot=FALSE)$count function as I don't need any of the other output that is produced (as I want to use it in an sapply call, requiring 1 million iterations in which this function would be called), e.g.
x = runif(100000000, 2.5, 2.6)
bincounts = hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count
Any thoughts?
A first attempt using table and cut:
table(cut(x, breaks=seq(0,3,length.out=100)))
It avoids the extra output, but takes about 34 seconds on my computer:
system.time(table(cut(x, breaks=seq(0,3,length.out=100))))
user system elapsed
34.148 0.532 34.696
compared to 3.5 seconds for hist:
system.time(hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count)
user system elapsed
3.448 0.156 3.605
Using tabulate and .bincode runs a little bit faster than hist:
tabulate(.bincode(x, breaks=seq(0,3,length.out=100)), nbins=100)
system.time(tabulate(.bincode(x, breaks=seq(0,3,length.out=100))), nbins=100)
user system elapsed
3.084 0.024 3.107
Using tablulate and findInterval provides a significant performance boost relative to table and cut and has an OK improvement relative to hist:
tabulate(findInterval(x, vec=seq(0,3,length.out=100)), nbins=100)
system.time(tabulate(findInterval(x, vec=seq(0,3,length.out=100))), nbins=100)
user system elapsed
2.044 0.012 2.055
Seems your best bet is to just cut out all the overhead of hist.default.
nB1 <- 99
delt <- 3/nB1
fuzz <- 1e-7 * c(-delt, rep.int(delt, nB1))
breaks <- seq(0, 3, by = delt) + fuzz
.Call(graphics:::C_BinCount, x, breaks, TRUE, TRUE)
I pared down to this by running debugonce(hist.default) to get a feel for exactly how hist works (and testing with a smaller vector -- n = 100 instead of 1000000).
Comparing:
x = runif(100, 2.5, 2.6)
y1 <- .Call(graphics:::C_BinCount, x, breaks + fuzz, TRUE, TRUE)
y2 <- hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count
identical(y1, y2)
# [1] TRUE
Related
I need to do calculations on a lot of big rasters(28000 cells, 181 layers). I tried my code on a small subset (24cells, 181 layers). I took help from this forum to optimize as much as I could.
Now I used bricks in raster package because I read that bricks are loaded into memory and are faster to process. But then some suggest that terra is better and easier than raster.
I used both packages to run my code and I see that raster is quicker in my case (9min compared with 16min). Now this time is for a small dataset. When I run it on original data the computer takes forever. When I check CPU and RAM use of computer during the process its around 16% and about 1GB respectively. Even if my code is inefficient, why isn't R using available RAM and CPU.
What I'm trying to do is implementing a model where I have to interpolate Landsat NDVI to daily timestep using spline. Then calculate different variables using mathematical equations. Some are really simple and straightforward, but some are really complex.
What I want is an efficient way of calculating different variables.
Would really appreciate if someone can explain:
Even if my code is inefficient what I believe is the computer should use maximum available resources. And is there a way to do that?
Is raster package better in my case?
I see some are recommending parallel processing. As I'm not a programmer so I'll have to read about it before implementing. I do know what it is though.
Apologies for not presenting the queries as they should be 1st time. Hope this is in better shape now. Thanks
Here is reproducible terra code (special thanks to Robert Hijmans):
library(terra)
b <- r1 <- r2 <- rast(ncols=5, nrows=5, nl=5, vals=NA)
set.seed(0)
values(b) <- runif(size(b))
b[c(1,2,3,22,23,24,25)] <- NA
p <- 0.15
p1 <- p/3
p2 <- p-(p/3)
fc <- 0.3
weather <- c(0.1, 0, 0, 0, 0.3)
r2[[1]] <- ifel(is.na(b[[1]]), NA, 0.3)
for (i in 1:nlyr(b)) {
varr1 <- b[[i]] * (((r2[[i]] - p1)/p2)^2)
r1[[i]] <- ifel(r2[[i]] > p, b[[i]], varr1)
for (k in 2:nlyr(b)) {
r2[[k]] <- min(r2[[k-1]] + (weather[k-1] - r1[[k-1]]) /100, fc)
}
}
Here is reproducible raster code:
library(raster)
b <- brick(ncols=5, nrows=5, nl=5)
inBrick <- setValues(b, runif(ncell(b) * nlayers(b)))
inBrick[c(1,2,3,22,23,24,25)] <- NA
outBrick1 <- inBrick
outBrick1[] <- NA
outBrick2 <- outBrick1
ini <- 0.3
p <- 0.15
p1 <- p/3
p2 <- p-(p/3)
fc <- 0.3
var1 <- which(!is.na(inBrick[[1]][]))
outBrick2[[1]][var1] <- ini
### now outBrick2 has initial values in 1st layer
weather <- c(0.1, 0, 0, 0, 0.3)
var3 <- 1:ncell(inBrick)
### outBrick1 Calculations
for (i in 1:nlayers(inBrick)) {
varr1 <- inBrick[[i]][]*(((outBrick2[[i]][]-p1)/(p2))^2)
for (j in 1:ncell(inBrick)) {
if(!is.na(outBrick2[[i]][j])){
if(outBrick2[[i]][j]>p){
outBrick1[[i]][j] <- inBrick[[i]][j]
}else{
outBrick1[[i]][j] <- varr1[j]
}
}
}
###outBrick2 Calculations
for (k in 2:nlayers(inBrick)) {
var2 <- outBrick2[[k-1]][] + (weather[k-1]-outBrick1[[k-1]][])/100
for(l in 1:ncell(inBrick)){
var3[l] <- min(fc, var2[l])
}
outBrick2[[k]][] <- var3
}
}
"terra" is generally faster, sometimes much faster. nukubiho points out that with arithmetic computations "raster" may be faster than "terra". However, the difference is generally small, and may only be true when the cell values are in memory.
However, this may not hold for large datasets (where these differences may actually matter). The cell values for these datasets are typically in a file.
In memory:
library("raster")
library("terra")
n = 12000
x_terra = rast(nrows = n, ncols = n, vals = rnorm(n ^ 2))
y_terra = rast(nrows = n, ncols = n, vals = rnorm(n ^ 2))
x_raster = raster(x_terra)
y_raster = raster(y_terra)
r <- c(x_terra, y_terra)
system.time({ (x_raster - y_raster) / (x_raster + y_raster) })
# user system elapsed
# 1.83 0.36 2.19
system.time({ (x_terra - y_terra) / (x_terra + y_terra) })
# user system elapsed
# 2.66 2.25 4.91
system.time(app(r, \(x) (x[,1]-x[,2]) / (x[,1] + x[,2])))
# user system elapsed
# 4.06 2.17 6.25
"raster" is faster. But when the values are in files, "terra" is faster.
x_terra = writeRaster(x_terra, "test1.tif", overwrite=T)
y_terra = writeRaster(y_terra, "test2.tif", overwrite=T)
x_raster = raster(x_terra)
y_raster = raster(y_terra)
r <- c(x_terra, y_terra)
system.time({ (x_raster - y_raster) / (x_raster + y_raster) })
# user system elapsed
# 17.25 5.39 22.66
system.time({ (x_terra - y_terra) / (x_terra + y_terra) })
# user system elapsed
# 13.63 4.92 18.61
system.time(app(r, \(x) (x[,1]-x[,2]) / (x[,1] + x[,2])))
# user system elapsed
# 9.78 3.46 13.24
Generally, terra is faster than raster, but one exception is in-memory arithmetic calculations. Probably the reason is that raster uses heavily optimized R code. See this simple example:
library("raster")
library("terra")
n = 12000
x_terra = rast(nrows = n, ncols = n, vals = rnorm(n ^ 2))
y_terra = rast(nrows = n, ncols = n, vals = rnorm(n ^ 2))
x_raster = raster(x_terra)
y_raster = raster(y_terra)
## terra
system.time({ (x_terra - y_terra) / (x_terra + y_terra) })
#> user system elapsed
#> 2.63 2.22 4.86
## raster
system.time({ (x_raster - y_raster) / (x_raster + y_raster) })
#> user system elapsed
#> 1.78 0.26 2.05
If you want to use more CPU and RAM, then maybe you should split data into blocks and process them in parallel? See these questions:
terra package returns error when try to run parallel operations
Process sets of rasters in parallel using lapp function from terra package
I just noticed that tapply() and reshape2::acast() are super slow and memory consuming in scenario when grouping by two variables!
See this example:
#download data and functions for monitoring time & memory
download.file("http://artax.karlin.mff.cuni.cz/~ttel5535/pub/so/tapply,reshape2_slow/tapply,reshape_slow.Rdata", "tapply,reshape_slow.Rdata", mode="wb")
load(file = "tapply,reshape_slow.Rdata")
require(reshape2)
mstart()
xx <- acast(bb, fi ~ gi, sum, value.var = "hour")
mstop()
# user system elapsed
# 6.58 0.79 7.90
#max memory used: 911.2Mb.
Surprisingly very slow and memory greedy! Just to show properties of the data:
nrow(bb)
#[1] 9467
dim(xx)
#[1] 4850 1492
print(object.size(xx), units = "Mb")
#28 Mb
Now tapply():
mstart()
xx2 <- tapply(bb$hour, list(bb$fi, bb$gi), sum, default = 0)
mstop()
# user system elapsed
# 6.45 2.36 9.44
#max memory used: 1135.9Mb.
Even slower and more memory greedy!
Now, to compare, solution when grouping is done by SQLite and acast() is used only for the reshaping:
require(sqldf)
mstart()
xx3_0 <- sqldf("select fi, gi, sum(hour) as sum from bb group by fi, gi")
xx3 <- acast(xx3_0, fi ~ gi, fill = 0, value.var = "sum")
mstop()
# user system elapsed
# 0.22 0.05 0.28
#max memory used: 174.1Mb.
Normally, I'm using sqldf for almost every data operations, but now I wanted to make it "easier" by using basic functions :-) But now I am really surprised that these functions perform really bad! Didn't anyone notice yet? Or am I using them wrong?
I have a simple task to do. I have a 3D array (10,1350,1280) and I want to calculate the min over the first dimensions. I can do it using aaply like the following
minObs <- plyr::aaply(obs, c(2,3), min) # min of observation
But it is extremely slow compared to when I just write a nested loop.
minObs<-matrix(nrow=dim(obs)[2],ncol=dim(obs)[3])
for (i in 1:dim(obs)[2]){
for (j in 1:dim(obs)[3]){
minObs[i,j]<-min(obs[,i,j],na.rm = TRUE)
}
}
I am new to R , but I am guessing that I am doing something wrong with aaply function. And hint would be very much appreciated. How can I speed up using aaply?
Why not just use the base apply function?
apply(obs, c(2,3), min)
It's fast, doesn't require loading an additional package and gives the same result, as per:
all.equal(
apply(obs, 2:3, min),
aaply(obs, 2:3, min), check.attributes=FALSE)
#[1] TRUE
Timings using system.time() using a 10 x 1350 x 1280 array:
Loop
# user system elapsed
# 3.79 0.00 3.79
Base apply()
# user system elapsed
# 2.87 0.02 2.89
plyr::aaply()
#Timing stopped at: 122.1 0.04 122.24
I'm running k-means clustering on a data frame df1, and I'm looking for a simple approach to computing the closest cluster center for each observation in a new data frame df2 (with the same variable names). Think of df1 as the training set and df2 on the testing set; I want to cluster on the training set and assign each test point to the correct cluster.
I know how to do this with the apply function and a few simple user-defined functions (previous posts on the topic have usually proposed something similar):
df1 <- data.frame(x=runif(100), y=runif(100))
df2 <- data.frame(x=runif(100), y=runif(100))
km <- kmeans(df1, centers=3)
closest.cluster <- function(x) {
cluster.dist <- apply(km$centers, 1, function(y) sqrt(sum((x-y)^2)))
return(which.min(cluster.dist)[1])
}
clusters2 <- apply(df2, 1, closest.cluster)
However, I'm preparing this clustering example for a course in which students will be unfamiliar with the apply function, so I would much prefer if I could assign the clusters to df2 with a built-in function. Are there any convenient built-in functions to find the closest cluster?
You could use the flexclust package, which has an implemented predict method for k-means:
library("flexclust")
data("Nclus")
set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)
dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE
cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
# 1 2 3 4
#130 181 98 91
pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])
image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")
There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:
as.kcca(cl, data=x)
# kcca object of family ‘kmeans’
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
# 1 2
# 50 50
Something I noticed about both the approach in the question and the flexclust approaches are that they are rather slow (benchmarked here for a training and testing set with 1 million observations with 2 features each).
Fitting the original model is reasonably fast:
set.seed(144)
df1 <- data.frame(x=runif(1e6), y=runif(1e6))
df2 <- data.frame(x=runif(1e6), y=runif(1e6))
system.time(km <- kmeans(df1, centers=3))
# user system elapsed
# 1.204 0.077 1.295
The solution I posted in the question is slow at calculating the test-set cluster assignments, since it separately calls closest.cluster for each test-set point:
system.time(pred.test <- apply(df2, 1, closest.cluster))
# user system elapsed
# 42.064 0.251 42.586
Meanwhile, the flexclust package seems to add a lot of overhead regardless of whether we convert the fitted model with as.kcca or fit a new one ourselves with kcca (though the prediction at the end is much faster)
# APPROACH #1: Convert from the kmeans() output
system.time(km.flexclust <- as.kcca(km, data=df1))
# user system elapsed
# 87.562 1.216 89.495
system.time(pred.flexclust <- predict(km.flexclust, newdata=df2))
# user system elapsed
# 0.182 0.065 0.250
# Approach #2: Fit the k-means clustering model in the flexclust package
system.time(km.flexclust2 <- kcca(df1, k=3, kccaFamily("kmeans")))
# user system elapsed
# 125.193 7.182 133.519
system.time(pred.flexclust2 <- predict(km.flexclust2, newdata=df2))
# user system elapsed
# 0.198 0.084 0.302
It seems that there is another sensible approach here: using a fast k-nearest neighbors solution like a k-d tree to find the nearest neighbor of each test-set observation within the set of cluster centroids. This can be written compactly and is relatively speedy:
library(FNN)
system.time(pred.knn <- get.knnx(km$center, df2, 1)$nn.index[,1])
# user system elapsed
# 0.315 0.013 0.345
all(pred.test == pred.knn)
# [1] TRUE
You can use the ClusterR::KMeans_rcpp() function, use RcppArmadillo. It allows for multiple initializations (which can be parallelized if Openmp is available). Besides optimal_init, quantile_init, random and kmeans ++ initilizations one can specify the centroids using the CENTROIDS parameter. The running time and convergence of the algorithm can be adjusted using the num_init, max_iters and tol parameters.
library(scorecard)
library(ClusterR)
library(dplyr)
library(ggplot2)
## Generate data
set.seed(2019)
x = c(rnorm(200000, 0,1), rnorm(150000, 5,1), rnorm(150000,-5,1))
y = c(rnorm(200000,-1,1), rnorm(150000, 6,1), rnorm(150000, 6,1))
df <- split_df(data.frame(x,y), ratio = 0.5, seed = 123)
system.time(
kmrcpp <- KMeans_rcpp(df$train, clusters = 3, num_init = 4, max_iters = 100, initializer = 'kmeans++'))
# user system elapsed
# 0.64 0.05 0.82
system.time(pr <- predict_KMeans(df$test, kmrcpp$centroids))
# user system elapsed
# 0.01 0.00 0.02
p1 <- df$train %>% mutate(cluster = as.factor(kmrcpp$clusters)) %>%
ggplot(., aes(x,y,color = cluster)) + geom_point() +
ggtitle("train data")
p2 <- df$test %>% mutate(cluster = as.factor(pr)) %>%
ggplot(., aes(x,y,color = cluster)) + geom_point() +
ggtitle("test data")
gridExtra::grid.arrange(p1,p2,ncol = 2)
I regularly find rolling things of time series (particularly means), and was surprised to find that rollmean is notably faster than rollapply, and that the align = 'right' methods are faster than the rollmeanr wrappers.
How have they achieved this speed up? And why does one lose some of it when using the rollmeanr() wrapper?
Some background: I had been using rollapplyr(x, n, function(X) mean(X)), however I recently happened upon a few examples using rollmean. The documents suggest rollapplyr(x, n, mean) (note without the function part of the argument) uses rollmean so I didn't think that there would be much difference in performance, however rbenchmark revealed notable differences.
require(zoo)
require(rbenchmark)
x <- rnorm(1e4)
r1 <- function() rollapplyr(x, 3, mean) # uses rollmean
r2 <- function() rollapplyr(x, 3, function(x) mean(x))
r3 <- function() rollmean(x, 3, na.pad = TRUE, align = 'right')
r4 <- function() rollmeanr(x, 3, align = "right")
bb <- benchmark(r1(), r2(), r3(), r4(),
columns = c('test', 'elapsed', 'relative'),
replications = 100,
order = 'elapsed')
print(bb)
I was surprised to find that rollmean(x, n, align = 'right') was notably faster -- and ~40x faster than my rollapply(x, n, function(X) mean(X)) approach.
test elapsed relative
3 r3() 0.74 1.000
4 r4() 0.86 1.162
1 r1() 0.98 1.324
2 r2() 27.53 37.203
The difference seems to get larger as the size of the data-set grows. I changed only the size of x (to rnorm(1e5)) in the above code and re-ran the test and there was an even larger difference between the functions.
test elapsed relative
3 r3() 13.33 1.000
4 r4() 17.43 1.308
1 r1() 19.83 1.488
2 r2() 279.47 20.965
and for x <- rnorm(1e6)
test elapsed relative
3 r3() 44.23 1.000
4 r4() 54.30 1.228
1 r1() 65.30 1.476
2 r2() 2473.35 55.920
How have they done this? Also, is this the optimal solution? Sure, this is fast but is there an even faster way to do this?
(Note: in general my time series are almost always xts objects -- does this matter?)
Computing the rolling mean is faster than computing a general rolling function, because the first one is easier to compute. When computing a general rolling function you have to compute the function on each window again and again, which you don't have to do for mean, because of the simple identity:
(a2 + a3 + ... + an)/(n-1) = (a1 + a2 + ... + a(n-1))/(n-1) + (an - a1)/(n-1)
and you can see how that's leveraged by looking at getAnywhere(rollmean.zoo).
If you want an even faster rolling mean, use runmean from caTools, which is implemented in C making it much faster (it also scales a lot better so will get even faster as the size of data increases).
library(microbenchmark)
library(caTools)
library(zoo)
x = rnorm(1e4)
microbenchmark(runmean(x, 3, endrule = 'trim', align = 'right'),
rollmean(x, 3, align = 'right'))
#Unit: microseconds
# expr min lq median uq max neval
# runmean(x, 3, endrule = "trim", align = "right") 631.061 740.0775 847.5915 1020.048 1652.109 100
# rollmean(x, 3, align = "right") 7308.947 9155.7155 10627.0210 12760.439 16919.092 100