Related
I am trying to find an efficient way to set a classification threshold for a predictive model's probability scores based on a custom performance metric in R. It is worth noting that the real data is imbalanced and has 35 million+ rows in the training set. This thus gives approximately 35 million predictive scores which could be set as the threshold split for the two classes. I have tried two approaches thus far
1. A 'smart', single thread approach trying to do minimal work
2. A brute-force, parallel multi-threaded approach.
Approach 1 performs a lot better, see below, but is still too slow
on the real data (I gave up after it had been running for 25+ hours). My question is if anyone has a better approach or knows a useful package for this? I have looked through stackoverflow and can't find anything similar. I would think some parallel version of my first approach would be the best option but since it relies on the results of the last iteration I don't think this is easy to do.
Benchmark test results on small data (1000 rows, run 100 times & 50,000 rows run 5 times):
Unit: milliseconds
expr min lq mean median uq max neval
minimal_single_thread(1000) 338.5525 366.5356 387.0256 384.0934 396.6146 714.5271 100
brut_force_multi_thread(1000, 20) 6121.4523 6206.6340 6279.6554 6253.2492 6324.4614 6593.9065 100
Unit: seconds
expr min lq mean median uq max neval
minimal_single_thread(50000) 20.45089 21.31735 21.41669 21.56343 21.78985 21.96191 5
brut_force_multi_thread(50000, 20) 797.55525 797.60568 799.15903 797.73044 798.24058 804.66320 5
Code:
Firstly the two functionised approaches
#1. A 'smart', single thread approach trying to do minimal work
minimal_single_thread<-function(n){
#create random predictions and observations i.e. the actuals
set.seed(10001)
comp <- data.table("pred"=runif(n),
"obs"=sample(0:1,n,replace=T))
#put in order of increasing prediction score
setorder(comp,pred)
#create table to hold performance metrics
optimum_threshold <- data.table("pred"=comp$pred)
#Get the number of predictions at each unique predicition score
#necessary as two cases could have same score
optimum_threshold <- optimum_threshold[, .(count = .N), by = pred]
setorder(optimum_threshold,pred)
#Add necessary columns
optimum_threshold[,f_measure:=0.0]
optimum_threshold[,TPR:=0.0]
optimum_threshold[,f_measure_unadj:=0.0]
optimum_threshold[,mcc:=0.0]
#Get totals for correcting the values for adjusted f-measure metric
num_negatives <- nrow(comp[obs==0,])
num_positives <- nrow(comp[obs==1,])
# Loop through all possible values of the cut-off(threshold) and store the confusion matrix scores
obs<-comp$obs
#need to compute logical every time for fp as you pred all 1 at first and then change to 0
comparison_fp_pred <- rep(1,length(obs))
comparison_fp <- (comparison_fp_pred & !obs)
#do need to for fn
comparison_fn_pred <- !rep(1,length(obs))
comparison_fn <- (comparison_fn_pred & obs)
act_pos<-sum(obs)
act_neg<-num_negatives
#keep count of last position for updating comparison
lst<-0L
row_ind <- 1L
for(pred_score_i in optimum_threshold$pred){
#find out how many cases at the predicted score
changed <- optimum_threshold[row_ind,count]
#Update the cases that have changed to the opposite to what they were before
#i.e. the predicition was 1 before and now is 0 so if pred was false before now true and vice versa all rest stays the same
comparison_fp_pred[(lst+1):(lst+changed)] <- !comparison_fp_pred[(lst+1):(lst+changed)]
comparison_fp[(lst+1):(lst+changed)] <- (comparison_fp_pred[(lst+1):(lst+changed)]& obs[(lst+1):(lst+changed)])
#need to calc logic for fn
comparison_fn_pred[(lst+1):(lst+changed)] <- !comparison_fn_pred[(lst+1):(lst+changed)]
comparison_fn[(lst+1):(lst+changed)] <- (comparison_fn_pred[(lst+1):(lst+changed)]& obs[(lst+1):(lst+changed)])
FP <- as.double(sum(comparison_fp))
FN <- as.double(sum(comparison_fn))
TN <- act_neg - FP
TP <- act_pos - FN
if(is.na(TN)) TN <- 0
if(is.na(TP)) TP <- 0
if(is.na(FN)) FN <- 0
if(is.na(FP)) FP <- 0
TPR <- TP/(TP+FN)
Precision <- TP/(TP+FP)
f1_unadj<-(2/((1/Precision)+(1/TPR)))
#mcc
MCC <- (TP*TN - FP*FN)/sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))
#for cases where precision or recall is 0 need to put 0 as total value to avoid math error
if(is.na(MCC)) MCC <- 0
TP_cor <- TP + num_positives*TPR
TN_cor <- TN - num_positives*(1-TPR)
FP_cor <- FP - num_positives*TPR
FN_cor <- FN + num_positives*(1-TPR)
TPR_cor <- TP_cor/(TP_cor+FN_cor)
Precision_cor <- TP_cor/(TP_cor+FP_cor)
f1<-(2/((1/Precision_cor)+(1/TPR_cor)))
#for cases where precision or recall is 0 need to put 0 as total value to avoid math error
if(is.na(f1)) f1 <- 0
set(optimum_threshold,i=row_ind,j="TPR",value=TPR)
set(optimum_threshold,i=row_ind,j="f_measure_unadj",value=f1_unadj)
set(optimum_threshold,i=row_ind,j="mcc",value=MCC)
set(optimum_threshold,i=row_ind,j="f_measure",value=f1)
#update references
lst <- lst+changed
row_ind <- row_ind+1L
}
# Threshold is the max adjusted f-measure
setorder(optimum_threshold,-f_measure)
threshold <- as.numeric(optimum_threshold[1,pred])
return(list("threshold"=threshold))
}
#2. A brute-force, parallel multi-threaded approach.
brut_force_multi_thread <-function(n,num_threads){
#create random predictions and observations i.e. the actuals
set.seed(10001)
optimum_threshold <- data.table("pred"=runif(n),
"obs"=sample(0:1,n,replace=T))
#put in order of increasing prediction score - performance metrics will be held here
setorder(optimum_threshold,pred)
#Get totals for correcting the values for adjusted f-measure metric
act_neg <- nrow(optimum_threshold[obs==0,])
act_pos <- nrow(optimum_threshold[obs==1,])
num_cases <- as.integer(act_pos+act_neg)
print(paste("Number of threads used",num_threads))
cl <- makeCluster(num_threads)
registerDoParallel(cl)
cl_return <- foreach(row_ind = 1L:nrow(optimum_threshold),
.packages = c("data.table")) %dopar% {
FP <- nrow(optimum_threshold[(row_ind+1L):num_cases,][obs==0,])
FN <- sum(optimum_threshold[1L:row_ind,obs])
TN <- act_neg - FP
TP <- act_pos - FN
if(is.na(TN)) TN <- 0
if(is.na(TP)) TP <- 0
if(is.na(FN)) FN <- 0
if(is.na(FP)) FP <- 0
TPR <- TP/(TP+FN)
Precision <- TP/(TP+FP)
f1_unadj<-(2/((1/Precision)+(1/TPR)))
#mcc
MCC <- (TP*TN - FP*FN)/sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))
#for cases where precision or recall is 0 need to put 0 as total value to avoid math error
if(is.na(MCC)) MCC <- 0
TP_cor <- TP + act_pos*TPR
TN_cor <- TN - act_pos*(1-TPR)
FP_cor <- FP - act_pos*TPR
FN_cor <- FN + act_pos*(1-TPR)
TPR_cor <- TP_cor/(TP_cor+FN_cor)
Precision_cor <- TP_cor/(TP_cor+FP_cor)
f1<-(2/((1/Precision_cor)+(1/TPR_cor)))
#for cases where precision or recall is 0 need to put 0 as total value to avoid math error
if(is.na(f1)) f1 <- 0
loop_dt <- data.table("pred"=optimum_threshold[row_ind,pred],"f_measure"=f1,
"TPR"=TPR,"f_measure_unadj"=f1_unadj,"mcc"=MCC)
return(loop_dt)
}
#stop cluster
stopCluster(cl)
#Combine all - Get unique values
optimum_threshold<-unique(rbindlist(cl_return))
# Threshold is the max adjusted f-measure
setorder(optimum_threshold,-f_measure)
threshold <- as.numeric(optimum_threshold[1,pred])
return(list("threshold"=threshold))
}
Next the comparison to ensure the same results are obtained from the two approaches:
library(data.table)
library(parallel)
library(doParallel)
library(foreach)
minimal_single_thread_return <- minimal_single_thread(100)
brut_force_multi_thread_return <- brut_force_multi_thread(100,5)
print(brut_force_multi_thread_return)
$threshold
[1] 0.008086668
print(minimal_single_thread_return)
$threshold
[1] 0.008086668
Lastly benchmarking on a dataset of 1,000 rows, run 100 times and 50,000 rows 5 times:
library(microbenchmark)
res <- microbenchmark(minimal_single_thread(1000),
brut_force_multi_thread(1000,20),
times=100L)
print(res)
res <- microbenchmark(minimal_single_thread(50000),
brut_force_multi_thread(50000,20),
times=5L)
print(res)
So based on the advice to look into the ROCR package I have found a sufficiently fast solution. I did this by passing the predictions and observations into the prediciton() from which I got the confusion table values (TP,FP,FN,TN) for each threshold choice. From there I just calculated all performance metrics in a datatable. The results are vast improvements on the previous best, benchmarking test results on small and large datasets (1000 rows, run 100 times & 50,000 rows run 5 times):
Unit: milliseconds
expr min lq mean median uq max neval
minimal_single_thread(1000) 334.515352 340.666631 353.93399 353.564355 362.62567 413.33399 100
ROCR_approach(1000) 9.377623 9.662029 10.38566 9.924076 10.37494 27.81753 100
Unit: milliseconds
expr min lq mean median uq max neval
minimal_single_thread(50000) 20375.35368 20470.45671 20594.56010 20534.32357 20696.55079 20896.11574 5
ROCR_approach(50000) 53.12959 53.60932 62.02762 53.74342 66.47123 83.18456 5
ROCR function:
ROCR_approach <-function(n){
#create random predictions and observations i.e. the actuals
set.seed(10001)
optimum_threshold <- data.table("pred"=runif(n),
"obs"=sample(0:1,n,replace=T))
#put in order of increasing prediction score - performance metrics will be held here
setorder(optimum_threshold,-pred)
#Get totals for correcting the values for adjusted f-measure metric
act_neg <- nrow(optimum_threshold[obs==0,])
act_pos <- nrow(optimum_threshold[obs==1,])
num_cases <- as.integer(act_pos+act_neg)
pred <- prediction(optimum_threshold$pred, optimum_threshold$obs)
optimum_threshold[,TP:=unlist(..pred#tp)[-length(unlist(..pred#tp))]]#[-1]]
optimum_threshold[,FP:=unlist(..pred#fp)[-length(unlist(..pred#tp))]]#[-1]]
optimum_threshold[,TN:=unlist(..pred#tn)[-length(unlist(..pred#tp))]]#[-1]]
optimum_threshold[,FN:=unlist(..pred#fn)[-length(unlist(..pred#tp))]]#[-1]]
rm(pred)
optimum_threshold[,TPR:=TP/(TP+FN)]
optimum_threshold[,f_measure_unadj:=(2/((1/(TP/(TP+FP)))+(1/TPR)))]
optimum_threshold[,mcc:= (TP*TN - FP*FN)/sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))]
optimum_threshold[,f_measure:=(2/((1/((TP + ..act_pos*TPR)/((TP + ..act_pos*TPR)+(FP - ..act_pos*TPR))))+
(1/((TP + ..act_pos*TPR)/((TP + ..act_pos*TPR)+(FN + ..act_pos*(1-TPR)))))))]
setorder(optimum_threshold,pred)
#set all to null
optimum_threshold[,obs:=NULL]
optimum_threshold[,TP:=NULL]
optimum_threshold[,FP:=NULL]
optimum_threshold[,TN:=NULL]
optimum_threshold[,FN:=NULL]
#set any na's to 0
for(col_i in seq_len(ncol(optimum_threshold)))
set(optimum_threshold,which(is.na(optimum_threshold[[col_i]])),col_i,0L)
# Threshold is the max adjusted f-measure
setorder(optimum_threshold,-f_measure)
threshold <- as.numeric(optimum_threshold[1,pred])
return(list("threshold"=threshold))
}
I am working on spatiotemporal observations of temperatures, stored in arrays of size 100*100*504 (100*100 grid, for 504 different hours representing 21 days). I am computing various indicators from those observations, for different periods (3 to 21 days), which obviously require some time, and I'm looking at improving computation efficiency. I am not really accustomed with R so I am not sure if what I am doing is the most efficient way.
One of the things I want to do is to find (for each cell) the longest continuous period of time where temperature is above a certain threshold. This is what I'm doing at the moment :
First I compute a boolean array based on the threshold using the following function.
utci_test = array(runif(100*100*504, min = 18, max = 42), c(100,100,504))
to_hs = function(utci, period=1:length(utci[1,1,]), hs_threshold){
utci_hs = utci*0
utci_hs[which(utci > hs_threshold)] = 1
utci_hs[is.na(utci)] = 0
return(utci_hs)
}
Then I transform each vector representing the hourly value for each cell into an rle object, and I return the maximum length of the 1's sequences (representing a continuous period over threshold).
max_duration_hs = function(utci_hs, period=1:length(utci_hs[1,1,]) ){
apply(utci_hs, MARGIN=c(1,2), FUN=function(x){
r = rle(x)
max(r$lengths[as.logical(r$values)], fill = 0)
})
}
Looking at the time required I noticed the second step is taking some time (bear in mind that I have to repeat this operation ~8000 times in total)
system.time(to_hs(utci_test, hs_threshold=32.0))
# utilisateur système écoulé
# 0.051 0.004 0.055
system.time(to_hs(utci_test, hs_threshold=32.0))
# utilisateur système écoulé
# 0.053 0.000 0.052
utci_test_sh = to_hs(utci_test, hs_threshold=32.0)
system.time(max_duration_hs(utci_test_sh))
# utilisateur système écoulé
# 0.456 0.012 0.468
So, I'm wondering if there is a more efficient way to do this as I guess transforming into rle object might be inefficient ?
You can get a bit of a speed bump by writing your own version of the rle() function that works because you know you want runs of 1's, and does a little less comparison. This gets you about 2x faster, down to a median time of about 250 milliseconds or so on my machine (a generic macbook).
If you have to do this 8,000 times you'll save yourself the most time by parallelizing the code to run on a multicore machine, which is straightforward to do in R (check out e.g. the parallel package).
Below the code for the speedup.
# generate data
set.seed(123)
utci_test <- array(runif(100*100*504, min = 18, max = 42), c(100,100,504))
# original functions
to_hs = function(utci, period=1:length(utci[1,1,]), hs_threshold){
utci_hs = utci*0
utci_hs[which(utci > hs_threshold)] = 1
utci_hs[is.na(utci)] = 0
return(utci_hs)
}
max_duration_hs = function(utci_hs, period=1:length(utci_hs[1,1,]) ){
apply(utci_hs, MARGIN=c(1,2), FUN=function(x){
r = rle(x)
max(r$lengths[as.logical(r$values)], fill = 0)
})
}
# helper func for rle
rle_max <- function(v) {
max(diff(c(0L, which(v==0), length(v)+1))) - 1
}
max_dur_hs_2 <- function(utci_hs) {
apply(utci_hs, MARGIN=c(1,2), FUN= rle_max)
}
# Check equivalence
utci_hs <- to_hs(utci = utci_test, hs_threshold = 32)
all.equal(max_dur_hs_2(utci_hs),
max_duration_hs(utci_hs))
#> [1] TRUE
# Test speed
library(microbenchmark)
microbenchmark(max_dur_hs_2(utci_hs),
max_duration_hs(utci_hs))
#> Unit: milliseconds
#> expr min lq mean median uq max
#> max_dur_hs_2(utci_hs) 216.1481 236.7825 250.9277 247.9918 262.4369 296.0146
#> max_duration_hs(utci_hs) 454.5740 476.5710 501.5119 489.9536 509.8750 774.9963
#> neval cld
#> 100 a
#> 100 b
Created on 2020-05-07 by the reprex package (v0.3.0)
Everything is in the question! I just tried to do a bit of optimization, and nailing down the bottle necks, out of curiosity, I tried that:
t1 <- rnorm(10)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
times = 10000)
and the result is that mean() is 6+ times slower than the computation "by hand"!
Does it stem from the overhead in the code of mean() before the call to the Internal(mean) or is it the C code itself which is slower? Why? Is there a good reason and thus a good use case?
It is due to the s3 look up for the method, and then the necessary parsing of arguments in mean.default. (and also the other code in mean)
sum and length are both Primitive functions. so will be fast (but how are you handling NA values?)
t1 <- rnorm(10)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
mean.default(t1),
.Internal(mean(t1)),
times = 10000)
Unit: nanoseconds
expr min lq median uq max neval
mean(t1) 10266 10951 11293 11635 1470714 10000
sum(t1)/length(t1) 684 1027 1369 1711 104367 10000
mean.default(t1) 2053 2396 2738 2739 1167195 10000
.Internal(mean(t1)) 342 343 685 685 86574 10000
The internal bit of mean is faster even than sum/length.
See http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table#method_dispatch_takes_time (mirror) for more details (and a data.table solution that avoids .Internal).
Note that if we increase the length of the vector, then the primitive approach is fastest
t1 <- rnorm(1e7)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
mean.default(t1),
.Internal(mean(t1)),
+ times = 100)
Unit: milliseconds
expr min lq median uq max neval
mean(t1) 25.79873 26.39242 26.56608 26.85523 33.36137 100
sum(t1)/length(t1) 15.02399 15.22948 15.31383 15.43239 19.20824 100
mean.default(t1) 25.69402 26.21466 26.44683 26.84257 33.62896 100
.Internal(mean(t1)) 25.70497 26.16247 26.39396 26.63982 35.21054 100
Now method dispatch is only a fraction of the overall "time" required.
mean is slower than computing "by hand" for several reasons:
S3 Method dispatch
NA handling
Error correction
Points 1 and 2 have already been covered. Point 3 is discussed in What algorithm is R using to calculate mean?. Basically, mean makes 2 passes over the vector in order to correct for floating point errors. sum only makes 1 pass over the vector.
Notice that identical(sum(t1)/length(t1), mean(t1)) may be FALSE, due to these precision issues.
> set.seed(21); t1 <- rnorm(1e7,,21)
> identical(sum(t1)/length(t1), mean(t1))
[1] FALSE
> sum(t1)/length(t1) - mean(t1)
[1] 2.539201e-16
I have a dataset composed of values obtained from studies and experiments. Experiments are nested within studies. I want to subsample the dataset so that only 1 experiment is represented for each study. I want to repeat this procedure 10,000 times, randomly drawing the 1 experiment each time, and then calculate some summary statistics for the values. Here is an example dataset:
df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))
I wrote the following function to do the above, but it is taking forever. Does anyone have any suggestions for streamlining this code? Thanks!
subsample=function(x,A) {
subsample.list=sapply(1:A,function(m) {
idx=ddply(x,c("study"),function(i) sample(1:nrow(i),1)) #Sample one experiment from each study
x[paste(x$study,x$expt,sep="-") %in% paste(idx$study,idx$V1,sep="-"),"value"] } ) #Match the study-experiment combinations and retrieve values
means.list=ldply(subsample.list,mean) #Calculate the mean of 'values' for each iteration
c(quantile(means.list$V1,0.025),mean(means.list$V1),upper=quantile(means.list$V1,0.975)) } #Calculate overall means and 95% CIs
You can vectorise this way more (even using plyr), and go much much faster:
function=yoursummary(x)c(quantile(x,0.025),mean(x),upper=quantile(x,0.975))
subsampleX=function(x,M)
yoursummary(
aaply(
daply(.drop_o=F,df,.(study),
function(x)sample(x$value,M,replace=T)
),1,mean
)
)
The trick here is to do all the sampling up front. If we want to sample M times, why not do all that while you have access to the study.
Original code:
> system.time(subsample(df,20000))
user system elapsed
123.23 0.06 124.74
New vectorised code:
> system.time(subsampleX(df,20000))
user system elapsed
0.24 0.00 0.25
That's about 500x faster.
Here's a base R solution which avoids ddply for speed reasons:
df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))
sample.experiments <- function(df) {
r <- rle(df$study)
samp <- sapply( r$lengths , function(x) sample(seq(x),1) )
start.idx <- c(0,cumsum(r$lengths)[1:(length(r$lengths)-1)] )
df[samp + start.idx,]
}
> sample.experiments(df)
study expt value
1 1 1 0.6113196
4 2 2 0.5026527
6 3 1 0.2803080
7 4 1 0.9824377
Benchmarks
> m <- microbenchmark(
+ ddply(df,.(study),function(i) i[sample(1:nrow(i),1),]) ,
+ sample.experiments(df)
+ )
> m
Unit: microseconds
expr min lq median uq max
1 ddply(df, .(study), function(i) i[sample(1:nrow(i), 1), ]) 3808.652 3883.632 3936.805 4022.725 6530.506
2 sample.experiments(df) 337.327 350.734 357.644 365.915 580.097
I want to partition a vector (length around 10^5) into five classes. With the function classIntervals from package classInt I wanted to use style = "jenks" natural breaks but this takes an inordinate amount of time even for a much smaller vector of only 500. Setting style = "kmeans" executes almost instantaneously.
library(classInt)
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)
system.time(classIntervals(x, n = 5, style = "jenks"))
R> system.time(classIntervals(x, n = 5, style = "jenks"))
user system elapsed
13.46 0.00 13.45
system.time(classIntervals(x, n = 5, style = "kmeans"))
R> system.time(classIntervals(x, n = 5, style = "kmeans"))
user system elapsed
0.02 0.00 0.02
What makes the Jenks algorithm so slow, and is there a faster way to run it?
If need be I will move the last two parts of the question to stats.stackexchange.com:
Under what circumstances is kmeans a reasonable substitute for Jenks?
Is it reasonable to define classes by running classInt on a random 1% subset of the data points?
To answer your original question:
What makes the Jenks algorithm so slow, and is there a faster way to
run it?
Indeed, meanwhile there is a faster way to apply the Jenks algorithm, the setjenksBreaks function in the BAMMtools package.
However, be aware that you have to set the number of breaks differently, i.e. if you set the breaks to 5 in the the classIntervals function of the classInt package you have to set the breaks to 6 the setjenksBreaks function in the BAMMtools package to get the same results.
# Install and load library
install.packages("BAMMtools")
library(BAMMtools)
# Set up example data
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)
# Apply function
getJenksBreaks(x, 6)
The speed up is huge, i.e.
> microbenchmark( getJenksBreaks(x, 6, subset = NULL), classIntervals(x, n = 5, style = "jenks"), unit="s", times=10)
Unit: seconds
expr min lq mean median uq max neval cld
getJenksBreaks(x, 6, subset = NULL) 0.002824861 0.003038748 0.003270575 0.003145692 0.003464058 0.004263771 10 a
classIntervals(x, n = 5, style = "jenks") 2.008109622 2.033353970 2.094278189 2.103680325 2.111840853 2.231148846 10
From ?BAMMtools::getJenksBreaks
The Jenks natural breaks method was ported to C from code found in the classInt R package.
The two programs are the same; one is faster than the other because of their implementation (C vs R).