I have a dataset composed of values obtained from studies and experiments. Experiments are nested within studies. I want to subsample the dataset so that only 1 experiment is represented for each study. I want to repeat this procedure 10,000 times, randomly drawing the 1 experiment each time, and then calculate some summary statistics for the values. Here is an example dataset:
df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))
I wrote the following function to do the above, but it is taking forever. Does anyone have any suggestions for streamlining this code? Thanks!
subsample=function(x,A) {
subsample.list=sapply(1:A,function(m) {
idx=ddply(x,c("study"),function(i) sample(1:nrow(i),1)) #Sample one experiment from each study
x[paste(x$study,x$expt,sep="-") %in% paste(idx$study,idx$V1,sep="-"),"value"] } ) #Match the study-experiment combinations and retrieve values
means.list=ldply(subsample.list,mean) #Calculate the mean of 'values' for each iteration
c(quantile(means.list$V1,0.025),mean(means.list$V1),upper=quantile(means.list$V1,0.975)) } #Calculate overall means and 95% CIs
You can vectorise this way more (even using plyr), and go much much faster:
function=yoursummary(x)c(quantile(x,0.025),mean(x),upper=quantile(x,0.975))
subsampleX=function(x,M)
yoursummary(
aaply(
daply(.drop_o=F,df,.(study),
function(x)sample(x$value,M,replace=T)
),1,mean
)
)
The trick here is to do all the sampling up front. If we want to sample M times, why not do all that while you have access to the study.
Original code:
> system.time(subsample(df,20000))
user system elapsed
123.23 0.06 124.74
New vectorised code:
> system.time(subsampleX(df,20000))
user system elapsed
0.24 0.00 0.25
That's about 500x faster.
Here's a base R solution which avoids ddply for speed reasons:
df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))
sample.experiments <- function(df) {
r <- rle(df$study)
samp <- sapply( r$lengths , function(x) sample(seq(x),1) )
start.idx <- c(0,cumsum(r$lengths)[1:(length(r$lengths)-1)] )
df[samp + start.idx,]
}
> sample.experiments(df)
study expt value
1 1 1 0.6113196
4 2 2 0.5026527
6 3 1 0.2803080
7 4 1 0.9824377
Benchmarks
> m <- microbenchmark(
+ ddply(df,.(study),function(i) i[sample(1:nrow(i),1),]) ,
+ sample.experiments(df)
+ )
> m
Unit: microseconds
expr min lq median uq max
1 ddply(df, .(study), function(i) i[sample(1:nrow(i), 1), ]) 3808.652 3883.632 3936.805 4022.725 6530.506
2 sample.experiments(df) 337.327 350.734 357.644 365.915 580.097
Related
I want to efficiently solve a degree-7 polynomial in k.
For example, with the following set of 7 unconditional probabilities,
p <- c(0.0496772, 0.04584501, 0.04210299, 0.04026439, 0.03844668, 0.03487194, 0.03137491)
the overall event probability is approximately 25% :
> 1 - prod(1 - p)
[1] 0.2506676
And if I want to approximate a constant k to proportionally change all elements of p so that the overall event probability is now approximately 30%, I can do this using an equation solver (such as Wolfram Alpha), which may use Newton's method or bisection to approximate k in:
here, k is approximately 1.23:
> 1 - prod(1 - 1.23*p)
[1] 0.3000173
But what if I want to solve this for many different overall event probabilities, how can I efficiently do this in R?
I've looked at the function SMfzero in the package NLRoot, but it's still not clear to me how I can achieve it.
EDIT
I've benchmarked the solutions so far. On the toy data p above:
Unit: nanoseconds
expr min lq mean median uq max neval
approximation_fun 800 1700 3306.7 3100 4400 39500 1000
polynom_fun 1583800 1748600 2067028.6 1846300 2036300 16332600 1000
polyroot_fun 596800 658300 863454.2 716250 792100 44709000 1000
bsoln_fun 48800 59800 87029.6 85100 102350 613300 1000
find_k_fun 48500 60700 86657.4 85250 103050 262600 1000
NB, I'm not sure if its fair to compare the approximation_fun with the others but I did ask for an approximate solution so it does meet the brief.
The real problem is a degree-52 polynomial in k. Benchmarking on the real data:
Unit: microseconds
expr min lq mean median uq max neval
approximation_fun 1.9 3.20 7.8745 5.50 14.50 55.5 1000
polynom_fun 10177.2 10965.20 12542.4195 11268.45 12149.95 80230.9 1000
bsoln_fun 52.3 60.95 91.4209 71.80 117.75 295.6 1000
find_k_fun 55.0 62.80 90.1710 73.10 118.40 358.2 1000
This can be solved with the polynom library.
library(polynom)
library(purrr)
p <- runif(3, 0, 1)
p
#> [1] 0.1072518 0.5781922 0.3877427
# Overall probability
1 - prod(1 - p)
#> [1] 0.7694434
# Target overall probability
target_op <- 0.3
# calculate polynomial to solve for k
poly_list <- p %>%
map(~polynomial(c(1, -.))) %>%
as.polylist()
# List of linear polynomials to be multiplied:
poly_list
#> [[1]]
#> 1 - 0.1072518*x
#>
#> [[2]]
#> 1 - 0.5781922*x
#>
#> [[3]]
#> 1 - 0.3877427*x
# we want to solve this polynomial
poly <- 1 - prod(poly_list) - target_op
poly
#> -0.3 + 1.073187*x - 0.3277881*x^2 + 0.02404476*x^3
roots <- solve(poly)
good_roots <-
roots %>%
# keep only real values
keep(~Im(.) == 0) %>%
Re() %>%
# only positive
keep(~.>0)
good_roots
#> [1] 0.1448852
k <- good_roots[[1]]
1 - prod(1 - k*p)
#> [1] 0.3
Created on 2021-04-28 by the reprex package (v1.0.0)
Following #IaroslavDomin's solutions, but constructing the coefficients for this particular case by hand, then using polyroot():
Here's a sequence of three functions (compute individual coeffs, put them together into a vector, find positive real roots):
## construct ith binomial coefficients: the sum of the products
## of all i-element combinations
bcoef <- function(p,i) {
sum(apply(combn(p,i),2,prod))
}
## compute all binomial coefficients and put them together
## into the vector of coeffs for 1-prod(1-k*p)
mypoly <- function(p,target=0.3) {
c(-target,-1*sapply(seq_along(p), bcoef, p =-p))
}
## compute real positive solutions
soln <- function(p, target=0.3) {
roots <- polyroot(mypoly(p))
roots <- Re(roots[abs(Im(roots))<1e-16])
roots <- roots[roots>0]
if (length(roots)>1) warn(">1 solution")
return(roots)
}
Try it out for a couple of cases:
p1 <- c(0.1072518,0.5781922, 0.3877427)
s1 <- soln(p1)
1-prod(1-s1*p1)
p2 <- c(0.0496772, 0.04584501, 0.04210299, 0.04026439, 0.03844668, 0.03487194, 0.03137491)
s2 <- soln(p2)
1-prod(1-s2*p2)
If you don't want to be clever, then brute force is perfectly adequate (56 microseconds on my machine when length(p) is 52):
bsoln <- function(p, target=0.3) {
f <- function(k) { (1-prod(1-k*p)) - target }
return(uniroot(f, c(0,20))$root)
}
asoln <- function(p, target=0.3) {
return(- log(1 - target) / sum(p))
}
I started to run benchmarks and gave up; I don't like the format of microbenchmark output and the approximate solution is too fast for rbenchmark::benchmark() to time accurately. In any case, one run of bsoln() with length(p)==52 takes on the order of 50 microseconds, so you're going to have to run this a whole bunch of times before speed becomes problematic ...
Another option would be to just search for a root on a segment without specifically calculating polynomial coefficients. This can be done e.g. with the uniroot function.
Only one not-so-trivial thing we need to do here is to specify the segment. k is obviously >=0 - so that would be the left point. Then we know that all the k*p values should be probabilities, hence <=1. Therefore k <= 1/max(p) - that's the right point.
And so the code is:
find_k <- function(p, taget_op) {
f <- function(x) 1 - prod(1 - x*p) - target_op
max_k <- 1/max(p)
res <- uniroot(f, c(0, max_k))
res$root
}
p <- runif(1000, 0, 1)
target_op <- 0.3
k <- find_k(p, target_op)
k
#> [1] 0.000710281
1 - prod(1 - k*p)
#> [1] 0.2985806
Created on 2021-04-29 by the reprex package (v1.0.0)
This works pretty fast even for 1000 probabilities.
I am trying to find an efficient way to set a classification threshold for a predictive model's probability scores based on a custom performance metric in R. It is worth noting that the real data is imbalanced and has 35 million+ rows in the training set. This thus gives approximately 35 million predictive scores which could be set as the threshold split for the two classes. I have tried two approaches thus far
1. A 'smart', single thread approach trying to do minimal work
2. A brute-force, parallel multi-threaded approach.
Approach 1 performs a lot better, see below, but is still too slow
on the real data (I gave up after it had been running for 25+ hours). My question is if anyone has a better approach or knows a useful package for this? I have looked through stackoverflow and can't find anything similar. I would think some parallel version of my first approach would be the best option but since it relies on the results of the last iteration I don't think this is easy to do.
Benchmark test results on small data (1000 rows, run 100 times & 50,000 rows run 5 times):
Unit: milliseconds
expr min lq mean median uq max neval
minimal_single_thread(1000) 338.5525 366.5356 387.0256 384.0934 396.6146 714.5271 100
brut_force_multi_thread(1000, 20) 6121.4523 6206.6340 6279.6554 6253.2492 6324.4614 6593.9065 100
Unit: seconds
expr min lq mean median uq max neval
minimal_single_thread(50000) 20.45089 21.31735 21.41669 21.56343 21.78985 21.96191 5
brut_force_multi_thread(50000, 20) 797.55525 797.60568 799.15903 797.73044 798.24058 804.66320 5
Code:
Firstly the two functionised approaches
#1. A 'smart', single thread approach trying to do minimal work
minimal_single_thread<-function(n){
#create random predictions and observations i.e. the actuals
set.seed(10001)
comp <- data.table("pred"=runif(n),
"obs"=sample(0:1,n,replace=T))
#put in order of increasing prediction score
setorder(comp,pred)
#create table to hold performance metrics
optimum_threshold <- data.table("pred"=comp$pred)
#Get the number of predictions at each unique predicition score
#necessary as two cases could have same score
optimum_threshold <- optimum_threshold[, .(count = .N), by = pred]
setorder(optimum_threshold,pred)
#Add necessary columns
optimum_threshold[,f_measure:=0.0]
optimum_threshold[,TPR:=0.0]
optimum_threshold[,f_measure_unadj:=0.0]
optimum_threshold[,mcc:=0.0]
#Get totals for correcting the values for adjusted f-measure metric
num_negatives <- nrow(comp[obs==0,])
num_positives <- nrow(comp[obs==1,])
# Loop through all possible values of the cut-off(threshold) and store the confusion matrix scores
obs<-comp$obs
#need to compute logical every time for fp as you pred all 1 at first and then change to 0
comparison_fp_pred <- rep(1,length(obs))
comparison_fp <- (comparison_fp_pred & !obs)
#do need to for fn
comparison_fn_pred <- !rep(1,length(obs))
comparison_fn <- (comparison_fn_pred & obs)
act_pos<-sum(obs)
act_neg<-num_negatives
#keep count of last position for updating comparison
lst<-0L
row_ind <- 1L
for(pred_score_i in optimum_threshold$pred){
#find out how many cases at the predicted score
changed <- optimum_threshold[row_ind,count]
#Update the cases that have changed to the opposite to what they were before
#i.e. the predicition was 1 before and now is 0 so if pred was false before now true and vice versa all rest stays the same
comparison_fp_pred[(lst+1):(lst+changed)] <- !comparison_fp_pred[(lst+1):(lst+changed)]
comparison_fp[(lst+1):(lst+changed)] <- (comparison_fp_pred[(lst+1):(lst+changed)]& obs[(lst+1):(lst+changed)])
#need to calc logic for fn
comparison_fn_pred[(lst+1):(lst+changed)] <- !comparison_fn_pred[(lst+1):(lst+changed)]
comparison_fn[(lst+1):(lst+changed)] <- (comparison_fn_pred[(lst+1):(lst+changed)]& obs[(lst+1):(lst+changed)])
FP <- as.double(sum(comparison_fp))
FN <- as.double(sum(comparison_fn))
TN <- act_neg - FP
TP <- act_pos - FN
if(is.na(TN)) TN <- 0
if(is.na(TP)) TP <- 0
if(is.na(FN)) FN <- 0
if(is.na(FP)) FP <- 0
TPR <- TP/(TP+FN)
Precision <- TP/(TP+FP)
f1_unadj<-(2/((1/Precision)+(1/TPR)))
#mcc
MCC <- (TP*TN - FP*FN)/sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))
#for cases where precision or recall is 0 need to put 0 as total value to avoid math error
if(is.na(MCC)) MCC <- 0
TP_cor <- TP + num_positives*TPR
TN_cor <- TN - num_positives*(1-TPR)
FP_cor <- FP - num_positives*TPR
FN_cor <- FN + num_positives*(1-TPR)
TPR_cor <- TP_cor/(TP_cor+FN_cor)
Precision_cor <- TP_cor/(TP_cor+FP_cor)
f1<-(2/((1/Precision_cor)+(1/TPR_cor)))
#for cases where precision or recall is 0 need to put 0 as total value to avoid math error
if(is.na(f1)) f1 <- 0
set(optimum_threshold,i=row_ind,j="TPR",value=TPR)
set(optimum_threshold,i=row_ind,j="f_measure_unadj",value=f1_unadj)
set(optimum_threshold,i=row_ind,j="mcc",value=MCC)
set(optimum_threshold,i=row_ind,j="f_measure",value=f1)
#update references
lst <- lst+changed
row_ind <- row_ind+1L
}
# Threshold is the max adjusted f-measure
setorder(optimum_threshold,-f_measure)
threshold <- as.numeric(optimum_threshold[1,pred])
return(list("threshold"=threshold))
}
#2. A brute-force, parallel multi-threaded approach.
brut_force_multi_thread <-function(n,num_threads){
#create random predictions and observations i.e. the actuals
set.seed(10001)
optimum_threshold <- data.table("pred"=runif(n),
"obs"=sample(0:1,n,replace=T))
#put in order of increasing prediction score - performance metrics will be held here
setorder(optimum_threshold,pred)
#Get totals for correcting the values for adjusted f-measure metric
act_neg <- nrow(optimum_threshold[obs==0,])
act_pos <- nrow(optimum_threshold[obs==1,])
num_cases <- as.integer(act_pos+act_neg)
print(paste("Number of threads used",num_threads))
cl <- makeCluster(num_threads)
registerDoParallel(cl)
cl_return <- foreach(row_ind = 1L:nrow(optimum_threshold),
.packages = c("data.table")) %dopar% {
FP <- nrow(optimum_threshold[(row_ind+1L):num_cases,][obs==0,])
FN <- sum(optimum_threshold[1L:row_ind,obs])
TN <- act_neg - FP
TP <- act_pos - FN
if(is.na(TN)) TN <- 0
if(is.na(TP)) TP <- 0
if(is.na(FN)) FN <- 0
if(is.na(FP)) FP <- 0
TPR <- TP/(TP+FN)
Precision <- TP/(TP+FP)
f1_unadj<-(2/((1/Precision)+(1/TPR)))
#mcc
MCC <- (TP*TN - FP*FN)/sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))
#for cases where precision or recall is 0 need to put 0 as total value to avoid math error
if(is.na(MCC)) MCC <- 0
TP_cor <- TP + act_pos*TPR
TN_cor <- TN - act_pos*(1-TPR)
FP_cor <- FP - act_pos*TPR
FN_cor <- FN + act_pos*(1-TPR)
TPR_cor <- TP_cor/(TP_cor+FN_cor)
Precision_cor <- TP_cor/(TP_cor+FP_cor)
f1<-(2/((1/Precision_cor)+(1/TPR_cor)))
#for cases where precision or recall is 0 need to put 0 as total value to avoid math error
if(is.na(f1)) f1 <- 0
loop_dt <- data.table("pred"=optimum_threshold[row_ind,pred],"f_measure"=f1,
"TPR"=TPR,"f_measure_unadj"=f1_unadj,"mcc"=MCC)
return(loop_dt)
}
#stop cluster
stopCluster(cl)
#Combine all - Get unique values
optimum_threshold<-unique(rbindlist(cl_return))
# Threshold is the max adjusted f-measure
setorder(optimum_threshold,-f_measure)
threshold <- as.numeric(optimum_threshold[1,pred])
return(list("threshold"=threshold))
}
Next the comparison to ensure the same results are obtained from the two approaches:
library(data.table)
library(parallel)
library(doParallel)
library(foreach)
minimal_single_thread_return <- minimal_single_thread(100)
brut_force_multi_thread_return <- brut_force_multi_thread(100,5)
print(brut_force_multi_thread_return)
$threshold
[1] 0.008086668
print(minimal_single_thread_return)
$threshold
[1] 0.008086668
Lastly benchmarking on a dataset of 1,000 rows, run 100 times and 50,000 rows 5 times:
library(microbenchmark)
res <- microbenchmark(minimal_single_thread(1000),
brut_force_multi_thread(1000,20),
times=100L)
print(res)
res <- microbenchmark(minimal_single_thread(50000),
brut_force_multi_thread(50000,20),
times=5L)
print(res)
So based on the advice to look into the ROCR package I have found a sufficiently fast solution. I did this by passing the predictions and observations into the prediciton() from which I got the confusion table values (TP,FP,FN,TN) for each threshold choice. From there I just calculated all performance metrics in a datatable. The results are vast improvements on the previous best, benchmarking test results on small and large datasets (1000 rows, run 100 times & 50,000 rows run 5 times):
Unit: milliseconds
expr min lq mean median uq max neval
minimal_single_thread(1000) 334.515352 340.666631 353.93399 353.564355 362.62567 413.33399 100
ROCR_approach(1000) 9.377623 9.662029 10.38566 9.924076 10.37494 27.81753 100
Unit: milliseconds
expr min lq mean median uq max neval
minimal_single_thread(50000) 20375.35368 20470.45671 20594.56010 20534.32357 20696.55079 20896.11574 5
ROCR_approach(50000) 53.12959 53.60932 62.02762 53.74342 66.47123 83.18456 5
ROCR function:
ROCR_approach <-function(n){
#create random predictions and observations i.e. the actuals
set.seed(10001)
optimum_threshold <- data.table("pred"=runif(n),
"obs"=sample(0:1,n,replace=T))
#put in order of increasing prediction score - performance metrics will be held here
setorder(optimum_threshold,-pred)
#Get totals for correcting the values for adjusted f-measure metric
act_neg <- nrow(optimum_threshold[obs==0,])
act_pos <- nrow(optimum_threshold[obs==1,])
num_cases <- as.integer(act_pos+act_neg)
pred <- prediction(optimum_threshold$pred, optimum_threshold$obs)
optimum_threshold[,TP:=unlist(..pred#tp)[-length(unlist(..pred#tp))]]#[-1]]
optimum_threshold[,FP:=unlist(..pred#fp)[-length(unlist(..pred#tp))]]#[-1]]
optimum_threshold[,TN:=unlist(..pred#tn)[-length(unlist(..pred#tp))]]#[-1]]
optimum_threshold[,FN:=unlist(..pred#fn)[-length(unlist(..pred#tp))]]#[-1]]
rm(pred)
optimum_threshold[,TPR:=TP/(TP+FN)]
optimum_threshold[,f_measure_unadj:=(2/((1/(TP/(TP+FP)))+(1/TPR)))]
optimum_threshold[,mcc:= (TP*TN - FP*FN)/sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))]
optimum_threshold[,f_measure:=(2/((1/((TP + ..act_pos*TPR)/((TP + ..act_pos*TPR)+(FP - ..act_pos*TPR))))+
(1/((TP + ..act_pos*TPR)/((TP + ..act_pos*TPR)+(FN + ..act_pos*(1-TPR)))))))]
setorder(optimum_threshold,pred)
#set all to null
optimum_threshold[,obs:=NULL]
optimum_threshold[,TP:=NULL]
optimum_threshold[,FP:=NULL]
optimum_threshold[,TN:=NULL]
optimum_threshold[,FN:=NULL]
#set any na's to 0
for(col_i in seq_len(ncol(optimum_threshold)))
set(optimum_threshold,which(is.na(optimum_threshold[[col_i]])),col_i,0L)
# Threshold is the max adjusted f-measure
setorder(optimum_threshold,-f_measure)
threshold <- as.numeric(optimum_threshold[1,pred])
return(list("threshold"=threshold))
}
I am working on spatiotemporal observations of temperatures, stored in arrays of size 100*100*504 (100*100 grid, for 504 different hours representing 21 days). I am computing various indicators from those observations, for different periods (3 to 21 days), which obviously require some time, and I'm looking at improving computation efficiency. I am not really accustomed with R so I am not sure if what I am doing is the most efficient way.
One of the things I want to do is to find (for each cell) the longest continuous period of time where temperature is above a certain threshold. This is what I'm doing at the moment :
First I compute a boolean array based on the threshold using the following function.
utci_test = array(runif(100*100*504, min = 18, max = 42), c(100,100,504))
to_hs = function(utci, period=1:length(utci[1,1,]), hs_threshold){
utci_hs = utci*0
utci_hs[which(utci > hs_threshold)] = 1
utci_hs[is.na(utci)] = 0
return(utci_hs)
}
Then I transform each vector representing the hourly value for each cell into an rle object, and I return the maximum length of the 1's sequences (representing a continuous period over threshold).
max_duration_hs = function(utci_hs, period=1:length(utci_hs[1,1,]) ){
apply(utci_hs, MARGIN=c(1,2), FUN=function(x){
r = rle(x)
max(r$lengths[as.logical(r$values)], fill = 0)
})
}
Looking at the time required I noticed the second step is taking some time (bear in mind that I have to repeat this operation ~8000 times in total)
system.time(to_hs(utci_test, hs_threshold=32.0))
# utilisateur système écoulé
# 0.051 0.004 0.055
system.time(to_hs(utci_test, hs_threshold=32.0))
# utilisateur système écoulé
# 0.053 0.000 0.052
utci_test_sh = to_hs(utci_test, hs_threshold=32.0)
system.time(max_duration_hs(utci_test_sh))
# utilisateur système écoulé
# 0.456 0.012 0.468
So, I'm wondering if there is a more efficient way to do this as I guess transforming into rle object might be inefficient ?
You can get a bit of a speed bump by writing your own version of the rle() function that works because you know you want runs of 1's, and does a little less comparison. This gets you about 2x faster, down to a median time of about 250 milliseconds or so on my machine (a generic macbook).
If you have to do this 8,000 times you'll save yourself the most time by parallelizing the code to run on a multicore machine, which is straightforward to do in R (check out e.g. the parallel package).
Below the code for the speedup.
# generate data
set.seed(123)
utci_test <- array(runif(100*100*504, min = 18, max = 42), c(100,100,504))
# original functions
to_hs = function(utci, period=1:length(utci[1,1,]), hs_threshold){
utci_hs = utci*0
utci_hs[which(utci > hs_threshold)] = 1
utci_hs[is.na(utci)] = 0
return(utci_hs)
}
max_duration_hs = function(utci_hs, period=1:length(utci_hs[1,1,]) ){
apply(utci_hs, MARGIN=c(1,2), FUN=function(x){
r = rle(x)
max(r$lengths[as.logical(r$values)], fill = 0)
})
}
# helper func for rle
rle_max <- function(v) {
max(diff(c(0L, which(v==0), length(v)+1))) - 1
}
max_dur_hs_2 <- function(utci_hs) {
apply(utci_hs, MARGIN=c(1,2), FUN= rle_max)
}
# Check equivalence
utci_hs <- to_hs(utci = utci_test, hs_threshold = 32)
all.equal(max_dur_hs_2(utci_hs),
max_duration_hs(utci_hs))
#> [1] TRUE
# Test speed
library(microbenchmark)
microbenchmark(max_dur_hs_2(utci_hs),
max_duration_hs(utci_hs))
#> Unit: milliseconds
#> expr min lq mean median uq max
#> max_dur_hs_2(utci_hs) 216.1481 236.7825 250.9277 247.9918 262.4369 296.0146
#> max_duration_hs(utci_hs) 454.5740 476.5710 501.5119 489.9536 509.8750 774.9963
#> neval cld
#> 100 a
#> 100 b
Created on 2020-05-07 by the reprex package (v0.3.0)
I have to calculate cosine similarity (patient similarity metric) in R between 48k patients data with some predictive variables. Here is the equation: PSM(P1,P2) = P1.P2/ ||P1|| ||P2||
where P1 and P2 are the predictor vectors corresponding to two different patients, where for example P1 index patient and P2 will be compared with index (P1) and finally pairwise patient similarity metric PSM(P1,P2) will be calculated.
This process will go on for all 48k patients.
I have added sample data-set for 300 patients in a .csv file. Please find the sample data-set here.https://1drv.ms/u/s!AhoddsPPvdj3hVTSbosv2KcPIx5a
First things first: You can find more rigorous treatments of cosine similarity at either of these posts:
Find cosine similarity between two arrays
Creating co-occurrence matrix
Now, you clearly have a mixture of data types in your input, at least
decimal
integer
categorical
I suspect that some of the integer values are Booleans or additional categoricals. Generally, it will be up to you to transform these into continuous numerical vectors if you want to use them as input into the similarity calculation. For example, what's the distance between admission types ELECTIVE and EMERGENCY? Is it a nominal or ordinal variable? I will only be modelling the columns that I trust to be numerical dependent variables.
Also, what have you done to ensure that some of your columns don't correlate with others? Using just a little awareness of data science and biomedical terminology, it seems likely that the following are all correlated:
diasbp_max, diasbp_min, meanbp_max, meanbp_min, sysbp_max and sysbp_min
I suggest going to a print shop and ordering a poster-size printout of psm_pairs.pdf. :-) Your eyes are better at detecting meaningful (but non-linear) dependencies between variable. Including multiple measurements of the same fundamental phenomenon may over-weight that phenomenon in your similarity calculation. Don't forget that you can derive variables like
diasbp_rage <- diasbp_max - diasbp_min
Now, I'm not especially good at linear algebra, so I'm importing a cosine similarity function form the lsa text analysis package. I'd love to see you write out the formula in your question as an R function. I would write it to compare one row to another, and use two nested apply loops to get all comparisons. Hopefully we'll get the same results!
After calculating the similarity, I try to find two different patients with the most dissimilar encounters.
Since you're working with a number of rows that's relatively large, you'll want to compare various algorithmic methodologies for efficiency. In addition, you could use SparkR/some other Hadoop solution on a cluster, or the parallel package on a single computer with multiple cores and lots of RAM. I have no idea whether the solution I provided is thread-safe.
Come to think of it, the transposition alone (as I implemented it) is likely to be computationally costly for a set of 1 million patient-encounters. Overall, (If I remember my computational complexity correctly) as the number of rows in your input increases, the performance could degrade exponentially.
library(lsa)
library(reshape2)
psm_sample <- read.csv("psm_sample.csv")
row.names(psm_sample) <-
make.names(paste0("patid.", as.character(psm_sample$subject_id)), unique = TRUE)
temp <- sapply(psm_sample, class)
temp <- cbind.data.frame(names(temp), as.character(temp))
names(temp) <- c("variable", "possible.type")
numeric.cols <- (temp$possible.type %in% c("factor", "integer") &
(!(grepl(
pattern = "_id$", x = temp$variable
))) &
(!(
grepl(pattern = "_code$", x = temp$variable)
)) &
(!(
grepl(pattern = "_type$", x = temp$variable)
))) | temp$possible.type == "numeric"
psm_numerics <- psm_sample[, numeric.cols]
row.names(psm_numerics) <- row.names(psm_sample)
psm_numerics$gender <- as.integer(psm_numerics$gender)
psm_scaled <- scale(psm_numerics)
pair.these.up <- psm_scaled
# checking for independence of variables
# if the following PDF pair plot is too big for your computer to open,
# try pair-plotting some random subset of columns
# keep.frac <- 0.5
# keep.flag <- runif(ncol(psm_scaled)) < keep.frac
# pair.these.up <- psm_scaled[, keep.flag]
# pdf device sizes are in inches
dev <-
pdf(
file = "psm_pairs.pdf",
width = 50,
height = 50,
paper = "special"
)
pairs(pair.these.up)
dev.off()
#transpose the dataframe to get the
#similarity between patients
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficnet, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
extract.pat <- function(enc.col) {
my.patients <-
sapply(enc.col, function(one.pat) {
temp <- (strsplit(as.character(one.pat), ".", fixed = TRUE))
return(temp[[1]][[2]])
})
return(my.patients)
}
cs.melt$pat.A <- extract.pat(cs.melt$enc.A)
cs.melt$pat.B <- extract.pat(cs.melt$enc.B)
same.pat <- cs.melt[cs.melt$pat.A == cs.melt$pat.B ,]
different.pat <- cs.melt[cs.melt$pat.A != cs.melt$pat.B ,]
most.dissimilar <-
different.pat[which.min(different.pat$similarity),]
dissimilar.pat.frame <- rbind(psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.A) ,],
psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.B) ,])
print(t(dissimilar.pat.frame))
which gives
patid.68.49 patid.9
gender 1.00000 2.00000
age 41.85000 41.79000
sysbp_min 72.00000 106.00000
sysbp_max 95.00000 217.00000
diasbp_min 42.00000 53.00000
diasbp_max 61.00000 107.00000
meanbp_min 52.00000 67.00000
meanbp_max 72.00000 132.00000
resprate_min 20.00000 14.00000
resprate_max 35.00000 19.00000
tempc_min 36.00000 35.50000
tempc_max 37.55555 37.88889
spo2_min 90.00000 95.00000
spo2_max 100.00000 100.00000
bicarbonate_min 22.00000 26.00000
bicarbonate_max 22.00000 30.00000
creatinine_min 2.50000 1.20000
creatinine_max 2.50000 1.40000
glucose_min 82.00000 129.00000
glucose_max 82.00000 178.00000
hematocrit_min 28.10000 37.40000
hematocrit_max 28.10000 45.20000
potassium_min 5.50000 2.80000
potassium_max 5.50000 3.00000
sodium_min 138.00000 136.00000
sodium_max 138.00000 140.00000
bun_min 28.00000 16.00000
bun_max 28.00000 17.00000
wbc_min 2.50000 7.50000
wbc_max 2.50000 13.70000
mingcs 15.00000 15.00000
gcsmotor 6.00000 5.00000
gcsverbal 5.00000 0.00000
gcseyes 4.00000 1.00000
endotrachflag 0.00000 1.00000
urineoutput 1674.00000 887.00000
vasopressor 0.00000 0.00000
vent 0.00000 1.00000
los_hospital 19.09310 4.88130
los_icu 3.53680 5.32310
sofa 3.00000 5.00000
saps 17.00000 18.00000
posthospmort30day 1.00000 0.00000
Usually I wouldn't add a second answer, but that might be the best solution here. Don't worry about voting on it.
Here's the same algorithm as in my first answer, applied to the iris data set. Each row contains four spatial measurements of the flowers form three different varieties of iris plants.
Below that you will find the iris analysis, written out as nested loops so you can see the equivalence. But that's not recommended for production with large data sets.
Please familiarize yourself with starting data and all of the intermediate dataframes:
The input iris data
psm_scaled (the spatial measurements, scaled to mean=0, SD=1)
cs (the matrix of pairwise similarities)
cs.melt (the pairwise similarities in long format)
At the end I have aggregated the mean similarities for all comparisons between one variety and another. You will see that comparisons between individuals of the same variety have mean similarities approaching 1, and comparisons between individuals of the same variety have mean similarities approaching negative 1.
library(lsa)
library(reshape2)
temp <- iris[, 1:4]
iris.names <- paste0(iris$Species, '.', rownames(iris))
psm_scaled <- scale(temp)
rownames(psm_scaled) <- iris.names
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficient, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
names(cs.melt) <- c("flower.A", "flower.B", "similarity")
class.A <-
strsplit(as.character(cs.melt$flower.A), '.', fixed = TRUE)
cs.melt$class.A <- sapply(class.A, function(one.split) {
return(one.split[1])
})
class.B <-
strsplit(as.character(cs.melt$flower.B), '.', fixed = TRUE)
cs.melt$class.B <- sapply(class.B, function(one.split) {
return(one.split[1])
})
cs.melt$comparison <-
paste0(cs.melt$class.A , '_vs_', cs.melt$class.B)
cs.agg <-
aggregate(cs.melt$similarity, by = list(cs.melt$comparison), mean)
print(cs.agg[order(cs.agg$x),])
which gives
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
If you’re still not comfortable with performing lsa::cosine() on a scaled, numerical dataframe, we can certainly do explicit pairwise calculations.
The formula you gave for PSM, or cosine similarity of patients, is expressed in two formats at Wikipedia
Remembering that vectors A and B represent the ordered list of attributes for PatientA and PatientB, the PSM is the dot product of A and B, divided by (the scalar product of [the magnitude of A] and [the magnitude of B])
The terse way of saying that in R is
cosine.sim <- function(A, B) { A %*% B / sqrt(A %*% A * B %*% B) }
But we can rewrite that to look more similar to your post as
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
I guess you could even re-write that (the calculations of similarity between a single pair of individuals) as a bunch of nested loops, but in the case of a manageable amount of data, please don’t. R is highly optimized for operations on vectors and matrices. If you’re new to R, don’t second guess it. By the way, what happened to your millions of rows? This will certainly be less stressful now that your down to tens of thousands.
Anyway, let’s say that each individual only has two elements.
individual.1 <- c(1, 0)
individual.2 <- c(1, 1)
So you can think of individual.1 as a line that passes between the origin (0,0) and (0, 1) and individual.2 as a line that passes between the origin and (1, 1).
some.data <- rbind.data.frame(individual.1, individual.2)
names(some.data) <- c('element.i', 'element.j')
rownames(some.data) <- c('individual.1', 'individual.2')
plot(some.data, xlim = c(-0.5, 2), ylim = c(-0.5, 2))
text(
some.data,
rownames(some.data),
xlim = c(-0.5, 2),
ylim = c(-0.5, 2),
adj = c(0, 0)
)
segments(0, 0, x1 = some.data[1, 1], y1 = some.data[1, 2])
segments(0, 0, x1 = some.data[2, 1], y1 = some.data[2, 2])
So what’s the angle between vector individual.1 and vector individual.2? You guessed it, 0.785 radians, or 45 degrees.
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
cos.sim.result <- cosine.sim(individual.1, individual.2)
angle.radians <- acos(cos.sim.result)
angle.degrees <- angle.radians * 180 / pi
print(angle.degrees)
# [,1]
# [1,] 45
Now we can use the cosine.sim function I previously defined, in two nested loops, to explicitly calculate the pairwise similarities between each of the iris flowers. Remember, psm_scaled has already been defined as the scaled numerical values from the iris dataset.
cs.melt <- lapply(rownames(psm_scaled), function(name.A) {
inner.loop.result <-
lapply(rownames(psm_scaled), function(name.B) {
individual.A <- psm_scaled[rownames(psm_scaled) == name.A, ]
individual.B <- psm_scaled[rownames(psm_scaled) == name.B, ]
similarity <- cosine.sim(individual.A, individual.B)
return(list(name.A, name.B, similarity))
})
inner.loop.result <-
do.call(rbind.data.frame, inner.loop.result)
names(inner.loop.result) <-
c('flower.A', 'flower.B', 'similarity')
return(inner.loop.result)
})
cs.melt <- do.call(rbind.data.frame, cs.melt)
Now we repeat the calculation of cs.melt$class.A, cs.melt$class.B, and cs.melt$comparison as above, and calculate cs.agg.from.loops as the mean similarity between the various types of comparisons:
cs.agg.from.loops <-
aggregate(cs.agg.from.loops$similarity, by = list(cs.agg.from.loops $comparison), mean)
print(cs.agg.from.loops[order(cs.agg.from.loops$x),])
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
Which, I believe is identical to the result we got with lsa::cosine.
So what I'm trying to say is... why wouldn't you use lsa::cosine?
Maybe you should be more concerned with
selection of variables, including removal of highly correlated variables
scaling/normalizing/standardizing the data
performance with a large input data set
identifying known similars and dissimilars for quality control
as previously addressed
I've been wrecking my head for the past four hours trying to find the solution to an R problem, which is driving me nuts. I've searching everywhere for a decent answer but so far I've been hitting wall after wall. I am now appealing to your good will of this fine community for help.
Consider the following dataset:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
I need to perform a t-test for every row in DataSample in order to find out if groups TRIAL and CONTROL differ (equal variance applies).
Then I need to count the number of rows with a p-value equal to, or lower than 0.05.
So here is the code I tried, which I know is wrong:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
pValResults <- apply(
DataSample[,1:12],1,function(x) t.test(x,DataSample[,13:24], var.equal=T)$p.value
)
sum(pValResults < 0.05) # Returns the wrong answer (so I was told)
I did try looking at many similar questions around stackoverflow, but I would often end-up with syntax errors or a dimensional mismatch. The code above is the best I could get without returning me an R error -- but I since the code is returning the wrong answer I have nothing to feel proud of.
Any advice will be greatly appreciated! Thanks in advance for your time.
One option is to loop over the data set calculating the t test for each row, but it is not as elegant.
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
# initialize vector of stored p-values
pvalue <- rep(0,nrow(DataSample))
for (i in 1:nrow(DataSample)){
pvalue[i] <- t.test(DataSample[i,1:12],DataSample[i,13:24])$p.value
}
# finding number that are significant
sum(pvalue < 0.05)
I converted to a data.table, and the answer I got was 45:
DataSample.dt <- as.data.table(DataSample)
sum(sapply(seq_len(nrow(DataSample.dt)), function(x)
t.test(DataSample.dt[x, paste0('Trial', 1:12), with=F],
DataSample.dt[x, paste0('Control', 13:24), with=F],
var.equal=T)$p.value) < 0.05)
To do a paired T test, you need to supply the paired = TRUE parameter. The t.test function isn't vectorised, but it's quite simple to do t tests a whole matrix at a time. Here's three methods (including using apply):
library("genefilter")
library("matrixStats")
library("microbenchmark")
dd <- DataSample[, 1:12] - DataSample[, 13:24]
microbenchmark::microbenchmark(
manual = {ps1 <- 2 * pt(-abs(rowMeans(dd) / sqrt(rowVars(dd) / ncol(dd))), ncol(dd) - 1)},
apply = {ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], paired=TRUE)$p.value)},
rowttests = {ps3 <- rowttests(dd)[, "p.value"]})
#Unit: milliseconds
# expr min lq mean median uq max
# manual 1.611808 1.641783 1.677010 1.663122 1.709401 1.852347
# apply 390.869635 398.720930 404.391487 401.508382 405.715668 634.932675
# rowttests 2.368823 2.417837 2.639671 2.574320 2.757870 7.207135
# neval
# 100
# 100
# 100
You can see the manual method is over 200x faster than apply.
If you actually meant an unpaired test, here's the equivalent comparison:
microbenchmark::microbenchmark(
manual = {x <- DataSample[, 1:12]; y <- DataSample[, 13:24]; ps1 <- 2 * pt(-abs((rowMeans(x) - rowMeans(y)) / sqrt((rowVars(x) + rowVars(y)) / ncol(x))), ncol(DataSample) - 2)},
apply = { ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], var.equal = TRUE)$p.value)},
rowttests = {ps3 <- rowttests(DataSample, factor(rep(1:2, each = 12)))[, "p.value"]})
Note the manual method assumes that the two groups are the same sizes.
Adding an alternative using an external library.
Performing the test:
library(matrixTests)
res <- row_t_equalvar(DataSample[,1:12], DataSample[,13:24])
Format of the result:
res
obs.x obs.y obs.tot mean.x mean.y mean.diff var.x var.y var.pooled stderr df statistic pvalue conf.low conf.high alternative mean.null conf.level
1 12 12 24 0.30569721 0.160622830 0.145074376 0.5034806 1.0769678 0.7902242 0.3629105 22 0.399752487 0.69319351 -0.6075559 0.89770469 two.sided 0 0.95
2 12 12 24 -0.27463354 -0.206396781 -0.068236762 0.8133311 0.2807800 0.5470556 0.3019535 22 -0.225984324 0.82329990 -0.6944500 0.55797651 two.sided 0 0.95
3 12 12 24 -0.19805092 -0.023207888 -0.174843032 0.4278359 0.5604078 0.4941219 0.2869733 22 -0.609265949 0.54858909 -0.7699891 0.42030307 two.sided 0 0.95
Number of rows with p <= 0.05:
> sum(res$pvalue <= 0.05)
[1] 4