Partition into classes: jenks vs kmeans - r

I want to partition a vector (length around 10^5) into five classes. With the function classIntervals from package classInt I wanted to use style = "jenks" natural breaks but this takes an inordinate amount of time even for a much smaller vector of only 500. Setting style = "kmeans" executes almost instantaneously.
library(classInt)
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)
system.time(classIntervals(x, n = 5, style = "jenks"))
R> system.time(classIntervals(x, n = 5, style = "jenks"))
user system elapsed
13.46 0.00 13.45
system.time(classIntervals(x, n = 5, style = "kmeans"))
R> system.time(classIntervals(x, n = 5, style = "kmeans"))
user system elapsed
0.02 0.00 0.02
What makes the Jenks algorithm so slow, and is there a faster way to run it?
If need be I will move the last two parts of the question to stats.stackexchange.com:
Under what circumstances is kmeans a reasonable substitute for Jenks?
Is it reasonable to define classes by running classInt on a random 1% subset of the data points?

To answer your original question:
What makes the Jenks algorithm so slow, and is there a faster way to
run it?
Indeed, meanwhile there is a faster way to apply the Jenks algorithm, the setjenksBreaks function in the BAMMtools package.
However, be aware that you have to set the number of breaks differently, i.e. if you set the breaks to 5 in the the classIntervals function of the classInt package you have to set the breaks to 6 the setjenksBreaks function in the BAMMtools package to get the same results.
# Install and load library
install.packages("BAMMtools")
library(BAMMtools)
# Set up example data
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)
# Apply function
getJenksBreaks(x, 6)
The speed up is huge, i.e.
> microbenchmark( getJenksBreaks(x, 6, subset = NULL), classIntervals(x, n = 5, style = "jenks"), unit="s", times=10)
Unit: seconds
expr min lq mean median uq max neval cld
getJenksBreaks(x, 6, subset = NULL) 0.002824861 0.003038748 0.003270575 0.003145692 0.003464058 0.004263771 10 a
classIntervals(x, n = 5, style = "jenks") 2.008109622 2.033353970 2.094278189 2.103680325 2.111840853 2.231148846 10

From ?BAMMtools::getJenksBreaks
The Jenks natural breaks method was ported to C from code found in the classInt R package.
The two programs are the same; one is faster than the other because of their implementation (C vs R).

Related

distance matrix on huge dataset on R

I have a huge dataset containing (1 382 400 datas). I have to do distance to do a HAC on it but when I do
dist(dataset)
I have the error :
cannot allocate vector of size 197.8 Gb
but my datas size si 1,1MB
What can I do to make it work ?
Thanks you !
You can use fastclustering library. There is a hclust.vector function which uses the memory efficient algorithm and avoid invocation of dist function. However, take patience, it took me 76 ms to make clustering for 10000 observations w/6 columns. So in case you have 100 times more observations it roughly takes 10000 times longer to calculate, at least. Unfortunately, no magic - saving space you increase time (see comment by SamR above).
library(microbenchmark)
library(fastcluster)
# Generating sample with ~ 80 thousand elements
# takes 80 ms
elements = 13824L
cols = 6L
x <- matrix(rnorm(elements), ncol = cols, nrow = elements / cols)
microbenchmark(fastcluster::hclust.vector(x))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fastcluster::hclust.vector(x) 73.3258 75.14525 76.12058 75.8676 76.9658 80.7951 100
# Generating sample with ~ 1,3 mln elements
elements = 1382400L
cols = 6L
x <- matrix(rnorm(elements), ncol = cols, nrow = elements / cols)
# Built-in clustering
stats::hclust(dist(x))
# Error: cannot allocate vector of size 197.8 Gb
# Using fastcluster library
fastcluster::hclust.vector(x)
# Take care, may take more than a hour
# I was not that patient :)

How to find longest sequence of values above threshold efficiently in R

I am working on spatiotemporal observations of temperatures, stored in arrays of size 100*100*504 (100*100 grid, for 504 different hours representing 21 days). I am computing various indicators from those observations, for different periods (3 to 21 days), which obviously require some time, and I'm looking at improving computation efficiency. I am not really accustomed with R so I am not sure if what I am doing is the most efficient way.
One of the things I want to do is to find (for each cell) the longest continuous period of time where temperature is above a certain threshold. This is what I'm doing at the moment :
First I compute a boolean array based on the threshold using the following function.
utci_test = array(runif(100*100*504, min = 18, max = 42), c(100,100,504))
to_hs = function(utci, period=1:length(utci[1,1,]), hs_threshold){
utci_hs = utci*0
utci_hs[which(utci > hs_threshold)] = 1
utci_hs[is.na(utci)] = 0
return(utci_hs)
}
Then I transform each vector representing the hourly value for each cell into an rle object, and I return the maximum length of the 1's sequences (representing a continuous period over threshold).
max_duration_hs = function(utci_hs, period=1:length(utci_hs[1,1,]) ){
apply(utci_hs, MARGIN=c(1,2), FUN=function(x){
r = rle(x)
max(r$lengths[as.logical(r$values)], fill = 0)
})
}
Looking at the time required I noticed the second step is taking some time (bear in mind that I have to repeat this operation ~8000 times in total)
system.time(to_hs(utci_test, hs_threshold=32.0))
# utilisateur système écoulé
# 0.051 0.004 0.055
system.time(to_hs(utci_test, hs_threshold=32.0))
# utilisateur système écoulé
# 0.053 0.000 0.052
utci_test_sh = to_hs(utci_test, hs_threshold=32.0)
system.time(max_duration_hs(utci_test_sh))
# utilisateur système écoulé
# 0.456 0.012 0.468
So, I'm wondering if there is a more efficient way to do this as I guess transforming into rle object might be inefficient ?
You can get a bit of a speed bump by writing your own version of the rle() function that works because you know you want runs of 1's, and does a little less comparison. This gets you about 2x faster, down to a median time of about 250 milliseconds or so on my machine (a generic macbook).
If you have to do this 8,000 times you'll save yourself the most time by parallelizing the code to run on a multicore machine, which is straightforward to do in R (check out e.g. the parallel package).
Below the code for the speedup.
# generate data
set.seed(123)
utci_test <- array(runif(100*100*504, min = 18, max = 42), c(100,100,504))
# original functions
to_hs = function(utci, period=1:length(utci[1,1,]), hs_threshold){
utci_hs = utci*0
utci_hs[which(utci > hs_threshold)] = 1
utci_hs[is.na(utci)] = 0
return(utci_hs)
}
max_duration_hs = function(utci_hs, period=1:length(utci_hs[1,1,]) ){
apply(utci_hs, MARGIN=c(1,2), FUN=function(x){
r = rle(x)
max(r$lengths[as.logical(r$values)], fill = 0)
})
}
# helper func for rle
rle_max <- function(v) {
max(diff(c(0L, which(v==0), length(v)+1))) - 1
}
max_dur_hs_2 <- function(utci_hs) {
apply(utci_hs, MARGIN=c(1,2), FUN= rle_max)
}
# Check equivalence
utci_hs <- to_hs(utci = utci_test, hs_threshold = 32)
all.equal(max_dur_hs_2(utci_hs),
max_duration_hs(utci_hs))
#> [1] TRUE
# Test speed
library(microbenchmark)
microbenchmark(max_dur_hs_2(utci_hs),
max_duration_hs(utci_hs))
#> Unit: milliseconds
#> expr min lq mean median uq max
#> max_dur_hs_2(utci_hs) 216.1481 236.7825 250.9277 247.9918 262.4369 296.0146
#> max_duration_hs(utci_hs) 454.5740 476.5710 501.5119 489.9536 509.8750 774.9963
#> neval cld
#> 100 a
#> 100 b
Created on 2020-05-07 by the reprex package (v0.3.0)

R - faster alternative to hist(XX, plot=FALSE)$count

I am on the lookout for a faster alternative to R's hist(x, breaks=XXX, plot=FALSE)$count function as I don't need any of the other output that is produced (as I want to use it in an sapply call, requiring 1 million iterations in which this function would be called), e.g.
x = runif(100000000, 2.5, 2.6)
bincounts = hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count
Any thoughts?
A first attempt using table and cut:
table(cut(x, breaks=seq(0,3,length.out=100)))
It avoids the extra output, but takes about 34 seconds on my computer:
system.time(table(cut(x, breaks=seq(0,3,length.out=100))))
user system elapsed
34.148 0.532 34.696
compared to 3.5 seconds for hist:
system.time(hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count)
user system elapsed
3.448 0.156 3.605
Using tabulate and .bincode runs a little bit faster than hist:
tabulate(.bincode(x, breaks=seq(0,3,length.out=100)), nbins=100)
system.time(tabulate(.bincode(x, breaks=seq(0,3,length.out=100))), nbins=100)
user system elapsed
3.084 0.024 3.107
Using tablulate and findInterval provides a significant performance boost relative to table and cut and has an OK improvement relative to hist:
tabulate(findInterval(x, vec=seq(0,3,length.out=100)), nbins=100)
system.time(tabulate(findInterval(x, vec=seq(0,3,length.out=100))), nbins=100)
user system elapsed
2.044 0.012 2.055
Seems your best bet is to just cut out all the overhead of hist.default.
nB1 <- 99
delt <- 3/nB1
fuzz <- 1e-7 * c(-delt, rep.int(delt, nB1))
breaks <- seq(0, 3, by = delt) + fuzz
.Call(graphics:::C_BinCount, x, breaks, TRUE, TRUE)
I pared down to this by running debugonce(hist.default) to get a feel for exactly how hist works (and testing with a smaller vector -- n = 100 instead of 1000000).
Comparing:
x = runif(100, 2.5, 2.6)
y1 <- .Call(graphics:::C_BinCount, x, breaks + fuzz, TRUE, TRUE)
y2 <- hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count
identical(y1, y2)
# [1] TRUE

trying to perform a t.test for each row and count all rows where p-value is less than 0.05

I've been wrecking my head for the past four hours trying to find the solution to an R problem, which is driving me nuts. I've searching everywhere for a decent answer but so far I've been hitting wall after wall. I am now appealing to your good will of this fine community for help.
Consider the following dataset:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
I need to perform a t-test for every row in DataSample in order to find out if groups TRIAL and CONTROL differ (equal variance applies).
Then I need to count the number of rows with a p-value equal to, or lower than 0.05.
So here is the code I tried, which I know is wrong:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
pValResults <- apply(
DataSample[,1:12],1,function(x) t.test(x,DataSample[,13:24], var.equal=T)$p.value
)
sum(pValResults < 0.05) # Returns the wrong answer (so I was told)
I did try looking at many similar questions around stackoverflow, but I would often end-up with syntax errors or a dimensional mismatch. The code above is the best I could get without returning me an R error -- but I since the code is returning the wrong answer I have nothing to feel proud of.
Any advice will be greatly appreciated! Thanks in advance for your time.
One option is to loop over the data set calculating the t test for each row, but it is not as elegant.
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
# initialize vector of stored p-values
pvalue <- rep(0,nrow(DataSample))
for (i in 1:nrow(DataSample)){
pvalue[i] <- t.test(DataSample[i,1:12],DataSample[i,13:24])$p.value
}
# finding number that are significant
sum(pvalue < 0.05)
I converted to a data.table, and the answer I got was 45:
DataSample.dt <- as.data.table(DataSample)
sum(sapply(seq_len(nrow(DataSample.dt)), function(x)
t.test(DataSample.dt[x, paste0('Trial', 1:12), with=F],
DataSample.dt[x, paste0('Control', 13:24), with=F],
var.equal=T)$p.value) < 0.05)
To do a paired T test, you need to supply the paired = TRUE parameter. The t.test function isn't vectorised, but it's quite simple to do t tests a whole matrix at a time. Here's three methods (including using apply):
library("genefilter")
library("matrixStats")
library("microbenchmark")
dd <- DataSample[, 1:12] - DataSample[, 13:24]
microbenchmark::microbenchmark(
manual = {ps1 <- 2 * pt(-abs(rowMeans(dd) / sqrt(rowVars(dd) / ncol(dd))), ncol(dd) - 1)},
apply = {ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], paired=TRUE)$p.value)},
rowttests = {ps3 <- rowttests(dd)[, "p.value"]})
#Unit: milliseconds
# expr min lq mean median uq max
# manual 1.611808 1.641783 1.677010 1.663122 1.709401 1.852347
# apply 390.869635 398.720930 404.391487 401.508382 405.715668 634.932675
# rowttests 2.368823 2.417837 2.639671 2.574320 2.757870 7.207135
# neval
# 100
# 100
# 100
You can see the manual method is over 200x faster than apply.
If you actually meant an unpaired test, here's the equivalent comparison:
microbenchmark::microbenchmark(
manual = {x <- DataSample[, 1:12]; y <- DataSample[, 13:24]; ps1 <- 2 * pt(-abs((rowMeans(x) - rowMeans(y)) / sqrt((rowVars(x) + rowVars(y)) / ncol(x))), ncol(DataSample) - 2)},
apply = { ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], var.equal = TRUE)$p.value)},
rowttests = {ps3 <- rowttests(DataSample, factor(rep(1:2, each = 12)))[, "p.value"]})
Note the manual method assumes that the two groups are the same sizes.
Adding an alternative using an external library.
Performing the test:
library(matrixTests)
res <- row_t_equalvar(DataSample[,1:12], DataSample[,13:24])
Format of the result:
res
obs.x obs.y obs.tot mean.x mean.y mean.diff var.x var.y var.pooled stderr df statistic pvalue conf.low conf.high alternative mean.null conf.level
1 12 12 24 0.30569721 0.160622830 0.145074376 0.5034806 1.0769678 0.7902242 0.3629105 22 0.399752487 0.69319351 -0.6075559 0.89770469 two.sided 0 0.95
2 12 12 24 -0.27463354 -0.206396781 -0.068236762 0.8133311 0.2807800 0.5470556 0.3019535 22 -0.225984324 0.82329990 -0.6944500 0.55797651 two.sided 0 0.95
3 12 12 24 -0.19805092 -0.023207888 -0.174843032 0.4278359 0.5604078 0.4941219 0.2869733 22 -0.609265949 0.54858909 -0.7699891 0.42030307 two.sided 0 0.95
Number of rows with p <= 0.05:
> sum(res$pvalue <= 0.05)
[1] 4

What makes rollmean faster than rollapply (code-wise)?

I regularly find rolling things of time series (particularly means), and was surprised to find that rollmean is notably faster than rollapply, and that the align = 'right' methods are faster than the rollmeanr wrappers.
How have they achieved this speed up? And why does one lose some of it when using the rollmeanr() wrapper?
Some background: I had been using rollapplyr(x, n, function(X) mean(X)), however I recently happened upon a few examples using rollmean. The documents suggest rollapplyr(x, n, mean) (note without the function part of the argument) uses rollmean so I didn't think that there would be much difference in performance, however rbenchmark revealed notable differences.
require(zoo)
require(rbenchmark)
x <- rnorm(1e4)
r1 <- function() rollapplyr(x, 3, mean) # uses rollmean
r2 <- function() rollapplyr(x, 3, function(x) mean(x))
r3 <- function() rollmean(x, 3, na.pad = TRUE, align = 'right')
r4 <- function() rollmeanr(x, 3, align = "right")
bb <- benchmark(r1(), r2(), r3(), r4(),
columns = c('test', 'elapsed', 'relative'),
replications = 100,
order = 'elapsed')
print(bb)
I was surprised to find that rollmean(x, n, align = 'right') was notably faster -- and ~40x faster than my rollapply(x, n, function(X) mean(X)) approach.
test elapsed relative
3 r3() 0.74 1.000
4 r4() 0.86 1.162
1 r1() 0.98 1.324
2 r2() 27.53 37.203
The difference seems to get larger as the size of the data-set grows. I changed only the size of x (to rnorm(1e5)) in the above code and re-ran the test and there was an even larger difference between the functions.
test elapsed relative
3 r3() 13.33 1.000
4 r4() 17.43 1.308
1 r1() 19.83 1.488
2 r2() 279.47 20.965
and for x <- rnorm(1e6)
test elapsed relative
3 r3() 44.23 1.000
4 r4() 54.30 1.228
1 r1() 65.30 1.476
2 r2() 2473.35 55.920
How have they done this? Also, is this the optimal solution? Sure, this is fast but is there an even faster way to do this?
(Note: in general my time series are almost always xts objects -- does this matter?)
Computing the rolling mean is faster than computing a general rolling function, because the first one is easier to compute. When computing a general rolling function you have to compute the function on each window again and again, which you don't have to do for mean, because of the simple identity:
(a2 + a3 + ... + an)/(n-1) = (a1 + a2 + ... + a(n-1))/(n-1) + (an - a1)/(n-1)
and you can see how that's leveraged by looking at getAnywhere(rollmean.zoo).
If you want an even faster rolling mean, use runmean from caTools, which is implemented in C making it much faster (it also scales a lot better so will get even faster as the size of data increases).
library(microbenchmark)
library(caTools)
library(zoo)
x = rnorm(1e4)
microbenchmark(runmean(x, 3, endrule = 'trim', align = 'right'),
rollmean(x, 3, align = 'right'))
#Unit: microseconds
# expr min lq median uq max neval
# runmean(x, 3, endrule = "trim", align = "right") 631.061 740.0775 847.5915 1020.048 1652.109 100
# rollmean(x, 3, align = "right") 7308.947 9155.7155 10627.0210 12760.439 16919.092 100

Resources