How to find longest sequence of values above threshold efficiently in R - r

I am working on spatiotemporal observations of temperatures, stored in arrays of size 100*100*504 (100*100 grid, for 504 different hours representing 21 days). I am computing various indicators from those observations, for different periods (3 to 21 days), which obviously require some time, and I'm looking at improving computation efficiency. I am not really accustomed with R so I am not sure if what I am doing is the most efficient way.
One of the things I want to do is to find (for each cell) the longest continuous period of time where temperature is above a certain threshold. This is what I'm doing at the moment :
First I compute a boolean array based on the threshold using the following function.
utci_test = array(runif(100*100*504, min = 18, max = 42), c(100,100,504))
to_hs = function(utci, period=1:length(utci[1,1,]), hs_threshold){
utci_hs = utci*0
utci_hs[which(utci > hs_threshold)] = 1
utci_hs[is.na(utci)] = 0
return(utci_hs)
}
Then I transform each vector representing the hourly value for each cell into an rle object, and I return the maximum length of the 1's sequences (representing a continuous period over threshold).
max_duration_hs = function(utci_hs, period=1:length(utci_hs[1,1,]) ){
apply(utci_hs, MARGIN=c(1,2), FUN=function(x){
r = rle(x)
max(r$lengths[as.logical(r$values)], fill = 0)
})
}
Looking at the time required I noticed the second step is taking some time (bear in mind that I have to repeat this operation ~8000 times in total)
system.time(to_hs(utci_test, hs_threshold=32.0))
# utilisateur système écoulé
# 0.051 0.004 0.055
system.time(to_hs(utci_test, hs_threshold=32.0))
# utilisateur système écoulé
# 0.053 0.000 0.052
utci_test_sh = to_hs(utci_test, hs_threshold=32.0)
system.time(max_duration_hs(utci_test_sh))
# utilisateur système écoulé
# 0.456 0.012 0.468
So, I'm wondering if there is a more efficient way to do this as I guess transforming into rle object might be inefficient ?

You can get a bit of a speed bump by writing your own version of the rle() function that works because you know you want runs of 1's, and does a little less comparison. This gets you about 2x faster, down to a median time of about 250 milliseconds or so on my machine (a generic macbook).
If you have to do this 8,000 times you'll save yourself the most time by parallelizing the code to run on a multicore machine, which is straightforward to do in R (check out e.g. the parallel package).
Below the code for the speedup.
# generate data
set.seed(123)
utci_test <- array(runif(100*100*504, min = 18, max = 42), c(100,100,504))
# original functions
to_hs = function(utci, period=1:length(utci[1,1,]), hs_threshold){
utci_hs = utci*0
utci_hs[which(utci > hs_threshold)] = 1
utci_hs[is.na(utci)] = 0
return(utci_hs)
}
max_duration_hs = function(utci_hs, period=1:length(utci_hs[1,1,]) ){
apply(utci_hs, MARGIN=c(1,2), FUN=function(x){
r = rle(x)
max(r$lengths[as.logical(r$values)], fill = 0)
})
}
# helper func for rle
rle_max <- function(v) {
max(diff(c(0L, which(v==0), length(v)+1))) - 1
}
max_dur_hs_2 <- function(utci_hs) {
apply(utci_hs, MARGIN=c(1,2), FUN= rle_max)
}
# Check equivalence
utci_hs <- to_hs(utci = utci_test, hs_threshold = 32)
all.equal(max_dur_hs_2(utci_hs),
max_duration_hs(utci_hs))
#> [1] TRUE
# Test speed
library(microbenchmark)
microbenchmark(max_dur_hs_2(utci_hs),
max_duration_hs(utci_hs))
#> Unit: milliseconds
#> expr min lq mean median uq max
#> max_dur_hs_2(utci_hs) 216.1481 236.7825 250.9277 247.9918 262.4369 296.0146
#> max_duration_hs(utci_hs) 454.5740 476.5710 501.5119 489.9536 509.8750 774.9963
#> neval cld
#> 100 a
#> 100 b
Created on 2020-05-07 by the reprex package (v0.3.0)

Related

distance matrix on huge dataset on R

I have a huge dataset containing (1 382 400 datas). I have to do distance to do a HAC on it but when I do
dist(dataset)
I have the error :
cannot allocate vector of size 197.8 Gb
but my datas size si 1,1MB
What can I do to make it work ?
Thanks you !
You can use fastclustering library. There is a hclust.vector function which uses the memory efficient algorithm and avoid invocation of dist function. However, take patience, it took me 76 ms to make clustering for 10000 observations w/6 columns. So in case you have 100 times more observations it roughly takes 10000 times longer to calculate, at least. Unfortunately, no magic - saving space you increase time (see comment by SamR above).
library(microbenchmark)
library(fastcluster)
# Generating sample with ~ 80 thousand elements
# takes 80 ms
elements = 13824L
cols = 6L
x <- matrix(rnorm(elements), ncol = cols, nrow = elements / cols)
microbenchmark(fastcluster::hclust.vector(x))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fastcluster::hclust.vector(x) 73.3258 75.14525 76.12058 75.8676 76.9658 80.7951 100
# Generating sample with ~ 1,3 mln elements
elements = 1382400L
cols = 6L
x <- matrix(rnorm(elements), ncol = cols, nrow = elements / cols)
# Built-in clustering
stats::hclust(dist(x))
# Error: cannot allocate vector of size 197.8 Gb
# Using fastcluster library
fastcluster::hclust.vector(x)
# Take care, may take more than a hour
# I was not that patient :)

Moving Maximum in last 5 minutes in R

I was wondering how to implement a moving maximum and minimum for a price in the last 5 minutes in O(n) time in R. My data consists of two columns: one with the time of day in seconds and the other with price. Right now, I take the current time, subtract 5 minutes, subset for the last 5 minutes, and then search for min and max at each index, so the operation is O(n^2). Is there any way to do this in O(n)?
Sample data:
time
[34200.19, 34200.23, 34201.45, ..., 35800, 35800.2, 35800.5]
price
[100, 103, 102, ..., 95, 97, 99]
The following compares a direct approach with a slightly more efficient varient, but it looks to scale as about n^1.6 on the values I've tried it with (10,000 - 100,000) - partly depends if incresing n is assumed to be more points in the same time period, or extending over a longer period.
#Create some data
n <- 10000
d <- data.frame(t=as.POSIXct(24*3600*runif(n), origin = "2014-01-01"),x=runif(n))
d <- d[order(d$t),]
d$inmax2 <-d$inmax <- rep(FALSE,n)
d$inmax2[1] <-d$inmax[1] <- TRUE
if (max(diff(d$t)) > 300) warning("There are gaps of more than 300 secs")
#Method 1, assume that you've done something like this
t1 <- system.time({
for (i in 2:n) d$inmax[i] <- !any((difftime(d$t[i], d$t[1:(i-1)] ,units="secs") < 300) & (d$x[i] < d$x[1:(i-1)] ))
})
#Method 2
t2 <- system.time({
cand <- 1
next_cand <- 2
while (next_cand <= n)
{
cand <- cand[difftime(d$t[next_cand],d$t[cand],units="secs")<300]
cand <- c(cand[d$x[cand] > d$x[next_cand]],next_cand)
if(length(cand)==1) d$inmax2[cand] <- TRUE
next_cand <- next_cand + 1
}
})
rbind(method1=t1,method2=t2)
# user.self sys.self elapsed user.child sys.child
# method1 14.98 0.03 15.04 NA NA
# method2 2.59 0.05 2.63 NA NA
all(d[[3]]==d[[4]])
# TRUE
The approach is to run through keeping all possible candidates in the past 5 minutes that are not less than the present one. If there are no such candidates the current must be the maximum. I assume that you can generalise to minimum.
Possibly doesn't work if you want to know maximum in last 5 minutes between datapoints rather than at datapoints though - not sure if you require that
Sort the dataframe by time first. Then maintain a max heap of the price, removing the lost price entries after every shift. Since rebalancing a heap is O(log n), this will be O(n log n). For implementing a max heap, consult any algorithms textbook (although I may edit this post later with one).

Convert Date to year month representation

I have a Date, and am interested in representing it as an integer of yyyymm form. Currently, I do:
get_year_month <- function(d) { return(as.integer(format(d, "%Y%m")))}
mydate = seq.Date(from = as.Date("2012-01-01"), to = as.Date("5012-01-01"), by = 1)
system.time(ym <- get_year_month(mydate))
# user system elapsed
# 5.972 0.974 6.951
This is very slow for large datasets. Is there a faster way? Please provide timings for your answers so they can be easily compared. Use the above example.
Using functions from the lubridate package can be almost twice as fast as your function :
mydate = as.Date(rep("2012-01-01",1000))
library(lubridate)
library(microbenchmark)
microbenchmark(get_year_month(mydate),
year(mydate)*100+month(mydate))
gives :
R> Unit: milliseconds
expr min lq median uq
get_year_month(mydate) 2.150296 2.188370 2.218176 2.285973
year(mydate) * 100 + month(mydate) 1.220016 1.228129 1.239704 1.284568
You can try using yearmon class from zoo package. In general if you are doing timeseries manipulation and analysis, I would suggest using xts or atleast zoo class. xts has lot of functionality for analysis of very huge timeseries data.
Here is quick benchmark against other suggested solutions.
get_year_month <- function(d) {
return(as.integer(format(d, "%Y%m")))
}
mydate = as.Date(rep("2012-01-01", 1e+06))
microbenchmark(get_year_month(mydate), year(mydate) * 100 + month(mydate), as.yearmon(mydate, format = "%Y-%m-%d"), times = 1)
## Unit: milliseconds
## expr min lq median uq max neval
## get_year_month(mydate) 1049.8813 1049.8813 1049.8813 1049.8813 1049.8813 1
## year(mydate) * 100 + month(mydate) 434.1765 434.1765 434.1765 434.1765 434.1765 1
## as.yearmon(mydate, format = "%Y-%m-%d") 249.6704 249.6704 249.6704 249.6704 249.6704 1
It would be best to keep your Dates in POSIXlt format if you want to manipulate them like that:
> system.time(ym <- get_year_month(mydate))
user system elapsed
4.039 0.025 4.079
> system.time(mydatep <- as.POSIXlt(mydate))
user system elapsed
3.576 0.016 3.603
> system.time(ym <- (1900 + mydatep$year)*100 + (mydatep$mon + 1))
user system elapsed
0.010 0.005 0.015
It's still a little faster, and you get subsequent similar operations for free, in terms of time.
There may not be a faster way for a single item. However you can make a version of the function that operates on collections run much faster than linearly by using builtin replicate e.g.
function mydate(D) {
x <- replicate(dim(D)[0], get_year_month(..)
return(x)
}

R - streamline subsampling procedure

I have a dataset composed of values obtained from studies and experiments. Experiments are nested within studies. I want to subsample the dataset so that only 1 experiment is represented for each study. I want to repeat this procedure 10,000 times, randomly drawing the 1 experiment each time, and then calculate some summary statistics for the values. Here is an example dataset:
df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))
I wrote the following function to do the above, but it is taking forever. Does anyone have any suggestions for streamlining this code? Thanks!
subsample=function(x,A) {
subsample.list=sapply(1:A,function(m) {
idx=ddply(x,c("study"),function(i) sample(1:nrow(i),1)) #Sample one experiment from each study
x[paste(x$study,x$expt,sep="-") %in% paste(idx$study,idx$V1,sep="-"),"value"] } ) #Match the study-experiment combinations and retrieve values
means.list=ldply(subsample.list,mean) #Calculate the mean of 'values' for each iteration
c(quantile(means.list$V1,0.025),mean(means.list$V1),upper=quantile(means.list$V1,0.975)) } #Calculate overall means and 95% CIs
You can vectorise this way more (even using plyr), and go much much faster:
function=yoursummary(x)c(quantile(x,0.025),mean(x),upper=quantile(x,0.975))
subsampleX=function(x,M)
yoursummary(
aaply(
daply(.drop_o=F,df,.(study),
function(x)sample(x$value,M,replace=T)
),1,mean
)
)
The trick here is to do all the sampling up front. If we want to sample M times, why not do all that while you have access to the study.
Original code:
> system.time(subsample(df,20000))
user system elapsed
123.23 0.06 124.74
New vectorised code:
> system.time(subsampleX(df,20000))
user system elapsed
0.24 0.00 0.25
That's about 500x faster.
Here's a base R solution which avoids ddply for speed reasons:
df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))
sample.experiments <- function(df) {
r <- rle(df$study)
samp <- sapply( r$lengths , function(x) sample(seq(x),1) )
start.idx <- c(0,cumsum(r$lengths)[1:(length(r$lengths)-1)] )
df[samp + start.idx,]
}
> sample.experiments(df)
study expt value
1 1 1 0.6113196
4 2 2 0.5026527
6 3 1 0.2803080
7 4 1 0.9824377
Benchmarks
> m <- microbenchmark(
+ ddply(df,.(study),function(i) i[sample(1:nrow(i),1),]) ,
+ sample.experiments(df)
+ )
> m
Unit: microseconds
expr min lq median uq max
1 ddply(df, .(study), function(i) i[sample(1:nrow(i), 1), ]) 3808.652 3883.632 3936.805 4022.725 6530.506
2 sample.experiments(df) 337.327 350.734 357.644 365.915 580.097

Partition into classes: jenks vs kmeans

I want to partition a vector (length around 10^5) into five classes. With the function classIntervals from package classInt I wanted to use style = "jenks" natural breaks but this takes an inordinate amount of time even for a much smaller vector of only 500. Setting style = "kmeans" executes almost instantaneously.
library(classInt)
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)
system.time(classIntervals(x, n = 5, style = "jenks"))
R> system.time(classIntervals(x, n = 5, style = "jenks"))
user system elapsed
13.46 0.00 13.45
system.time(classIntervals(x, n = 5, style = "kmeans"))
R> system.time(classIntervals(x, n = 5, style = "kmeans"))
user system elapsed
0.02 0.00 0.02
What makes the Jenks algorithm so slow, and is there a faster way to run it?
If need be I will move the last two parts of the question to stats.stackexchange.com:
Under what circumstances is kmeans a reasonable substitute for Jenks?
Is it reasonable to define classes by running classInt on a random 1% subset of the data points?
To answer your original question:
What makes the Jenks algorithm so slow, and is there a faster way to
run it?
Indeed, meanwhile there is a faster way to apply the Jenks algorithm, the setjenksBreaks function in the BAMMtools package.
However, be aware that you have to set the number of breaks differently, i.e. if you set the breaks to 5 in the the classIntervals function of the classInt package you have to set the breaks to 6 the setjenksBreaks function in the BAMMtools package to get the same results.
# Install and load library
install.packages("BAMMtools")
library(BAMMtools)
# Set up example data
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)
# Apply function
getJenksBreaks(x, 6)
The speed up is huge, i.e.
> microbenchmark( getJenksBreaks(x, 6, subset = NULL), classIntervals(x, n = 5, style = "jenks"), unit="s", times=10)
Unit: seconds
expr min lq mean median uq max neval cld
getJenksBreaks(x, 6, subset = NULL) 0.002824861 0.003038748 0.003270575 0.003145692 0.003464058 0.004263771 10 a
classIntervals(x, n = 5, style = "jenks") 2.008109622 2.033353970 2.094278189 2.103680325 2.111840853 2.231148846 10
From ?BAMMtools::getJenksBreaks
The Jenks natural breaks method was ported to C from code found in the classInt R package.
The two programs are the same; one is faster than the other because of their implementation (C vs R).

Resources