I want to efficiently solve a degree-7 polynomial in k.
For example, with the following set of 7 unconditional probabilities,
p <- c(0.0496772, 0.04584501, 0.04210299, 0.04026439, 0.03844668, 0.03487194, 0.03137491)
the overall event probability is approximately 25% :
> 1 - prod(1 - p)
[1] 0.2506676
And if I want to approximate a constant k to proportionally change all elements of p so that the overall event probability is now approximately 30%, I can do this using an equation solver (such as Wolfram Alpha), which may use Newton's method or bisection to approximate k in:
here, k is approximately 1.23:
> 1 - prod(1 - 1.23*p)
[1] 0.3000173
But what if I want to solve this for many different overall event probabilities, how can I efficiently do this in R?
I've looked at the function SMfzero in the package NLRoot, but it's still not clear to me how I can achieve it.
EDIT
I've benchmarked the solutions so far. On the toy data p above:
Unit: nanoseconds
expr min lq mean median uq max neval
approximation_fun 800 1700 3306.7 3100 4400 39500 1000
polynom_fun 1583800 1748600 2067028.6 1846300 2036300 16332600 1000
polyroot_fun 596800 658300 863454.2 716250 792100 44709000 1000
bsoln_fun 48800 59800 87029.6 85100 102350 613300 1000
find_k_fun 48500 60700 86657.4 85250 103050 262600 1000
NB, I'm not sure if its fair to compare the approximation_fun with the others but I did ask for an approximate solution so it does meet the brief.
The real problem is a degree-52 polynomial in k. Benchmarking on the real data:
Unit: microseconds
expr min lq mean median uq max neval
approximation_fun 1.9 3.20 7.8745 5.50 14.50 55.5 1000
polynom_fun 10177.2 10965.20 12542.4195 11268.45 12149.95 80230.9 1000
bsoln_fun 52.3 60.95 91.4209 71.80 117.75 295.6 1000
find_k_fun 55.0 62.80 90.1710 73.10 118.40 358.2 1000
This can be solved with the polynom library.
library(polynom)
library(purrr)
p <- runif(3, 0, 1)
p
#> [1] 0.1072518 0.5781922 0.3877427
# Overall probability
1 - prod(1 - p)
#> [1] 0.7694434
# Target overall probability
target_op <- 0.3
# calculate polynomial to solve for k
poly_list <- p %>%
map(~polynomial(c(1, -.))) %>%
as.polylist()
# List of linear polynomials to be multiplied:
poly_list
#> [[1]]
#> 1 - 0.1072518*x
#>
#> [[2]]
#> 1 - 0.5781922*x
#>
#> [[3]]
#> 1 - 0.3877427*x
# we want to solve this polynomial
poly <- 1 - prod(poly_list) - target_op
poly
#> -0.3 + 1.073187*x - 0.3277881*x^2 + 0.02404476*x^3
roots <- solve(poly)
good_roots <-
roots %>%
# keep only real values
keep(~Im(.) == 0) %>%
Re() %>%
# only positive
keep(~.>0)
good_roots
#> [1] 0.1448852
k <- good_roots[[1]]
1 - prod(1 - k*p)
#> [1] 0.3
Created on 2021-04-28 by the reprex package (v1.0.0)
Following #IaroslavDomin's solutions, but constructing the coefficients for this particular case by hand, then using polyroot():
Here's a sequence of three functions (compute individual coeffs, put them together into a vector, find positive real roots):
## construct ith binomial coefficients: the sum of the products
## of all i-element combinations
bcoef <- function(p,i) {
sum(apply(combn(p,i),2,prod))
}
## compute all binomial coefficients and put them together
## into the vector of coeffs for 1-prod(1-k*p)
mypoly <- function(p,target=0.3) {
c(-target,-1*sapply(seq_along(p), bcoef, p =-p))
}
## compute real positive solutions
soln <- function(p, target=0.3) {
roots <- polyroot(mypoly(p))
roots <- Re(roots[abs(Im(roots))<1e-16])
roots <- roots[roots>0]
if (length(roots)>1) warn(">1 solution")
return(roots)
}
Try it out for a couple of cases:
p1 <- c(0.1072518,0.5781922, 0.3877427)
s1 <- soln(p1)
1-prod(1-s1*p1)
p2 <- c(0.0496772, 0.04584501, 0.04210299, 0.04026439, 0.03844668, 0.03487194, 0.03137491)
s2 <- soln(p2)
1-prod(1-s2*p2)
If you don't want to be clever, then brute force is perfectly adequate (56 microseconds on my machine when length(p) is 52):
bsoln <- function(p, target=0.3) {
f <- function(k) { (1-prod(1-k*p)) - target }
return(uniroot(f, c(0,20))$root)
}
asoln <- function(p, target=0.3) {
return(- log(1 - target) / sum(p))
}
I started to run benchmarks and gave up; I don't like the format of microbenchmark output and the approximate solution is too fast for rbenchmark::benchmark() to time accurately. In any case, one run of bsoln() with length(p)==52 takes on the order of 50 microseconds, so you're going to have to run this a whole bunch of times before speed becomes problematic ...
Another option would be to just search for a root on a segment without specifically calculating polynomial coefficients. This can be done e.g. with the uniroot function.
Only one not-so-trivial thing we need to do here is to specify the segment. k is obviously >=0 - so that would be the left point. Then we know that all the k*p values should be probabilities, hence <=1. Therefore k <= 1/max(p) - that's the right point.
And so the code is:
find_k <- function(p, taget_op) {
f <- function(x) 1 - prod(1 - x*p) - target_op
max_k <- 1/max(p)
res <- uniroot(f, c(0, max_k))
res$root
}
p <- runif(1000, 0, 1)
target_op <- 0.3
k <- find_k(p, target_op)
k
#> [1] 0.000710281
1 - prod(1 - k*p)
#> [1] 0.2985806
Created on 2021-04-29 by the reprex package (v1.0.0)
This works pretty fast even for 1000 probabilities.
Related
I am trying to find an efficient way to set a classification threshold for a predictive model's probability scores based on a custom performance metric in R. It is worth noting that the real data is imbalanced and has 35 million+ rows in the training set. This thus gives approximately 35 million predictive scores which could be set as the threshold split for the two classes. I have tried two approaches thus far
1. A 'smart', single thread approach trying to do minimal work
2. A brute-force, parallel multi-threaded approach.
Approach 1 performs a lot better, see below, but is still too slow
on the real data (I gave up after it had been running for 25+ hours). My question is if anyone has a better approach or knows a useful package for this? I have looked through stackoverflow and can't find anything similar. I would think some parallel version of my first approach would be the best option but since it relies on the results of the last iteration I don't think this is easy to do.
Benchmark test results on small data (1000 rows, run 100 times & 50,000 rows run 5 times):
Unit: milliseconds
expr min lq mean median uq max neval
minimal_single_thread(1000) 338.5525 366.5356 387.0256 384.0934 396.6146 714.5271 100
brut_force_multi_thread(1000, 20) 6121.4523 6206.6340 6279.6554 6253.2492 6324.4614 6593.9065 100
Unit: seconds
expr min lq mean median uq max neval
minimal_single_thread(50000) 20.45089 21.31735 21.41669 21.56343 21.78985 21.96191 5
brut_force_multi_thread(50000, 20) 797.55525 797.60568 799.15903 797.73044 798.24058 804.66320 5
Code:
Firstly the two functionised approaches
#1. A 'smart', single thread approach trying to do minimal work
minimal_single_thread<-function(n){
#create random predictions and observations i.e. the actuals
set.seed(10001)
comp <- data.table("pred"=runif(n),
"obs"=sample(0:1,n,replace=T))
#put in order of increasing prediction score
setorder(comp,pred)
#create table to hold performance metrics
optimum_threshold <- data.table("pred"=comp$pred)
#Get the number of predictions at each unique predicition score
#necessary as two cases could have same score
optimum_threshold <- optimum_threshold[, .(count = .N), by = pred]
setorder(optimum_threshold,pred)
#Add necessary columns
optimum_threshold[,f_measure:=0.0]
optimum_threshold[,TPR:=0.0]
optimum_threshold[,f_measure_unadj:=0.0]
optimum_threshold[,mcc:=0.0]
#Get totals for correcting the values for adjusted f-measure metric
num_negatives <- nrow(comp[obs==0,])
num_positives <- nrow(comp[obs==1,])
# Loop through all possible values of the cut-off(threshold) and store the confusion matrix scores
obs<-comp$obs
#need to compute logical every time for fp as you pred all 1 at first and then change to 0
comparison_fp_pred <- rep(1,length(obs))
comparison_fp <- (comparison_fp_pred & !obs)
#do need to for fn
comparison_fn_pred <- !rep(1,length(obs))
comparison_fn <- (comparison_fn_pred & obs)
act_pos<-sum(obs)
act_neg<-num_negatives
#keep count of last position for updating comparison
lst<-0L
row_ind <- 1L
for(pred_score_i in optimum_threshold$pred){
#find out how many cases at the predicted score
changed <- optimum_threshold[row_ind,count]
#Update the cases that have changed to the opposite to what they were before
#i.e. the predicition was 1 before and now is 0 so if pred was false before now true and vice versa all rest stays the same
comparison_fp_pred[(lst+1):(lst+changed)] <- !comparison_fp_pred[(lst+1):(lst+changed)]
comparison_fp[(lst+1):(lst+changed)] <- (comparison_fp_pred[(lst+1):(lst+changed)]& obs[(lst+1):(lst+changed)])
#need to calc logic for fn
comparison_fn_pred[(lst+1):(lst+changed)] <- !comparison_fn_pred[(lst+1):(lst+changed)]
comparison_fn[(lst+1):(lst+changed)] <- (comparison_fn_pred[(lst+1):(lst+changed)]& obs[(lst+1):(lst+changed)])
FP <- as.double(sum(comparison_fp))
FN <- as.double(sum(comparison_fn))
TN <- act_neg - FP
TP <- act_pos - FN
if(is.na(TN)) TN <- 0
if(is.na(TP)) TP <- 0
if(is.na(FN)) FN <- 0
if(is.na(FP)) FP <- 0
TPR <- TP/(TP+FN)
Precision <- TP/(TP+FP)
f1_unadj<-(2/((1/Precision)+(1/TPR)))
#mcc
MCC <- (TP*TN - FP*FN)/sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))
#for cases where precision or recall is 0 need to put 0 as total value to avoid math error
if(is.na(MCC)) MCC <- 0
TP_cor <- TP + num_positives*TPR
TN_cor <- TN - num_positives*(1-TPR)
FP_cor <- FP - num_positives*TPR
FN_cor <- FN + num_positives*(1-TPR)
TPR_cor <- TP_cor/(TP_cor+FN_cor)
Precision_cor <- TP_cor/(TP_cor+FP_cor)
f1<-(2/((1/Precision_cor)+(1/TPR_cor)))
#for cases where precision or recall is 0 need to put 0 as total value to avoid math error
if(is.na(f1)) f1 <- 0
set(optimum_threshold,i=row_ind,j="TPR",value=TPR)
set(optimum_threshold,i=row_ind,j="f_measure_unadj",value=f1_unadj)
set(optimum_threshold,i=row_ind,j="mcc",value=MCC)
set(optimum_threshold,i=row_ind,j="f_measure",value=f1)
#update references
lst <- lst+changed
row_ind <- row_ind+1L
}
# Threshold is the max adjusted f-measure
setorder(optimum_threshold,-f_measure)
threshold <- as.numeric(optimum_threshold[1,pred])
return(list("threshold"=threshold))
}
#2. A brute-force, parallel multi-threaded approach.
brut_force_multi_thread <-function(n,num_threads){
#create random predictions and observations i.e. the actuals
set.seed(10001)
optimum_threshold <- data.table("pred"=runif(n),
"obs"=sample(0:1,n,replace=T))
#put in order of increasing prediction score - performance metrics will be held here
setorder(optimum_threshold,pred)
#Get totals for correcting the values for adjusted f-measure metric
act_neg <- nrow(optimum_threshold[obs==0,])
act_pos <- nrow(optimum_threshold[obs==1,])
num_cases <- as.integer(act_pos+act_neg)
print(paste("Number of threads used",num_threads))
cl <- makeCluster(num_threads)
registerDoParallel(cl)
cl_return <- foreach(row_ind = 1L:nrow(optimum_threshold),
.packages = c("data.table")) %dopar% {
FP <- nrow(optimum_threshold[(row_ind+1L):num_cases,][obs==0,])
FN <- sum(optimum_threshold[1L:row_ind,obs])
TN <- act_neg - FP
TP <- act_pos - FN
if(is.na(TN)) TN <- 0
if(is.na(TP)) TP <- 0
if(is.na(FN)) FN <- 0
if(is.na(FP)) FP <- 0
TPR <- TP/(TP+FN)
Precision <- TP/(TP+FP)
f1_unadj<-(2/((1/Precision)+(1/TPR)))
#mcc
MCC <- (TP*TN - FP*FN)/sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))
#for cases where precision or recall is 0 need to put 0 as total value to avoid math error
if(is.na(MCC)) MCC <- 0
TP_cor <- TP + act_pos*TPR
TN_cor <- TN - act_pos*(1-TPR)
FP_cor <- FP - act_pos*TPR
FN_cor <- FN + act_pos*(1-TPR)
TPR_cor <- TP_cor/(TP_cor+FN_cor)
Precision_cor <- TP_cor/(TP_cor+FP_cor)
f1<-(2/((1/Precision_cor)+(1/TPR_cor)))
#for cases where precision or recall is 0 need to put 0 as total value to avoid math error
if(is.na(f1)) f1 <- 0
loop_dt <- data.table("pred"=optimum_threshold[row_ind,pred],"f_measure"=f1,
"TPR"=TPR,"f_measure_unadj"=f1_unadj,"mcc"=MCC)
return(loop_dt)
}
#stop cluster
stopCluster(cl)
#Combine all - Get unique values
optimum_threshold<-unique(rbindlist(cl_return))
# Threshold is the max adjusted f-measure
setorder(optimum_threshold,-f_measure)
threshold <- as.numeric(optimum_threshold[1,pred])
return(list("threshold"=threshold))
}
Next the comparison to ensure the same results are obtained from the two approaches:
library(data.table)
library(parallel)
library(doParallel)
library(foreach)
minimal_single_thread_return <- minimal_single_thread(100)
brut_force_multi_thread_return <- brut_force_multi_thread(100,5)
print(brut_force_multi_thread_return)
$threshold
[1] 0.008086668
print(minimal_single_thread_return)
$threshold
[1] 0.008086668
Lastly benchmarking on a dataset of 1,000 rows, run 100 times and 50,000 rows 5 times:
library(microbenchmark)
res <- microbenchmark(minimal_single_thread(1000),
brut_force_multi_thread(1000,20),
times=100L)
print(res)
res <- microbenchmark(minimal_single_thread(50000),
brut_force_multi_thread(50000,20),
times=5L)
print(res)
So based on the advice to look into the ROCR package I have found a sufficiently fast solution. I did this by passing the predictions and observations into the prediciton() from which I got the confusion table values (TP,FP,FN,TN) for each threshold choice. From there I just calculated all performance metrics in a datatable. The results are vast improvements on the previous best, benchmarking test results on small and large datasets (1000 rows, run 100 times & 50,000 rows run 5 times):
Unit: milliseconds
expr min lq mean median uq max neval
minimal_single_thread(1000) 334.515352 340.666631 353.93399 353.564355 362.62567 413.33399 100
ROCR_approach(1000) 9.377623 9.662029 10.38566 9.924076 10.37494 27.81753 100
Unit: milliseconds
expr min lq mean median uq max neval
minimal_single_thread(50000) 20375.35368 20470.45671 20594.56010 20534.32357 20696.55079 20896.11574 5
ROCR_approach(50000) 53.12959 53.60932 62.02762 53.74342 66.47123 83.18456 5
ROCR function:
ROCR_approach <-function(n){
#create random predictions and observations i.e. the actuals
set.seed(10001)
optimum_threshold <- data.table("pred"=runif(n),
"obs"=sample(0:1,n,replace=T))
#put in order of increasing prediction score - performance metrics will be held here
setorder(optimum_threshold,-pred)
#Get totals for correcting the values for adjusted f-measure metric
act_neg <- nrow(optimum_threshold[obs==0,])
act_pos <- nrow(optimum_threshold[obs==1,])
num_cases <- as.integer(act_pos+act_neg)
pred <- prediction(optimum_threshold$pred, optimum_threshold$obs)
optimum_threshold[,TP:=unlist(..pred#tp)[-length(unlist(..pred#tp))]]#[-1]]
optimum_threshold[,FP:=unlist(..pred#fp)[-length(unlist(..pred#tp))]]#[-1]]
optimum_threshold[,TN:=unlist(..pred#tn)[-length(unlist(..pred#tp))]]#[-1]]
optimum_threshold[,FN:=unlist(..pred#fn)[-length(unlist(..pred#tp))]]#[-1]]
rm(pred)
optimum_threshold[,TPR:=TP/(TP+FN)]
optimum_threshold[,f_measure_unadj:=(2/((1/(TP/(TP+FP)))+(1/TPR)))]
optimum_threshold[,mcc:= (TP*TN - FP*FN)/sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))]
optimum_threshold[,f_measure:=(2/((1/((TP + ..act_pos*TPR)/((TP + ..act_pos*TPR)+(FP - ..act_pos*TPR))))+
(1/((TP + ..act_pos*TPR)/((TP + ..act_pos*TPR)+(FN + ..act_pos*(1-TPR)))))))]
setorder(optimum_threshold,pred)
#set all to null
optimum_threshold[,obs:=NULL]
optimum_threshold[,TP:=NULL]
optimum_threshold[,FP:=NULL]
optimum_threshold[,TN:=NULL]
optimum_threshold[,FN:=NULL]
#set any na's to 0
for(col_i in seq_len(ncol(optimum_threshold)))
set(optimum_threshold,which(is.na(optimum_threshold[[col_i]])),col_i,0L)
# Threshold is the max adjusted f-measure
setorder(optimum_threshold,-f_measure)
threshold <- as.numeric(optimum_threshold[1,pred])
return(list("threshold"=threshold))
}
I can do it for the two sample t test but not for Median test or Wilcoxon test or Hodges Lehmann test
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
mean(data_2000)
mean(data_2019)
mean(data_2019) - mean(data_2000)
combined_data <- c(data_2000, data_2019)
set.seed(123)
null_dist <- c()
for (i in 1:100000) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
null_dist[i] <- mean(shuffled_2019) - mean(shuffled_2000)
}
(p_value <- (sum(null_dist >= 49.57143) + sum(null_dist <=
`enter code here`-49.57143))/length(null_dist))
I think this is what you're trying to do. I altered your code as little as possible. There are packages like infer that will do this for you and the for loop is not the most efficient but it's plenty good enough and may help you learn. As long as we're looping I did mean and median at the same time since all other parts of the code are identical. ifelse is a nice easy way to make 1s and 0s to sum.
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
delta_mean <- mean(data_2019) - mean(data_2000)
delta_median <- median(data_2019) - median(data_2000)
combined_data <- c(data_2000, data_2019)
trials <- 100000
set.seed(123)
mean_diff <- c()
median_diff <- c()
for (i in 1:trials) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
mean_diff[i] <- mean(shuffled_2019) - mean(shuffled_2000)
median_diff[i] <- median(shuffled_2019) - median(shuffled_2000)
}
p_mean <- sum(ifelse(mean_diff > delta_mean | mean_diff < -1 * delta_mean, 1, 0)) / trials
p_median <- sum(ifelse(median_diff > delta_median | median_diff < -1 * delta_median, 1, 0)) / trials
p_mean
#> [1] 0.31888
p_median
#> [1] 0.24446
Following up on your question about HL test. Quoting Wikipedia
The Hodges–Lehmann statistic also estimates the difference between two populations. For two sets of data with m and n observations, the set of two-element sets made of them is their Cartesian product, which contains m × n pairs of points (one from each set); each such pair defines one difference of values. The Hodges–Lehmann statistic is the median of the m × n differences.
You could run it on your data with the following code...
Do NOT run it 100,000 times the answer is the same everytime because you're already making all 49 possible pairings
hl_df <- expand.grid(data_2019, data_2000)
hl_df$pair_diffs <- hl_df$Var1 - hl_df$Var2
median(hl_df$pair_diffs)
[1] 49
You can do the Wilcoxon test with wilcox.test in the stats package (loaded by default as part of R core). You need to set exact = FALSE because an exact p-value is not possible if there are ties.
wilcox.test(data_2019, data_2000, exact = FALSE)
Wilcoxon rank sum test with continuity correction
data: data_2019 and data_2000
W = 33.5, p-value = 0.2769
alternative hypothesis: true location shift is not equal to 0
I'll update this when I figure out how to do the other tests.
I've been wrecking my head for the past four hours trying to find the solution to an R problem, which is driving me nuts. I've searching everywhere for a decent answer but so far I've been hitting wall after wall. I am now appealing to your good will of this fine community for help.
Consider the following dataset:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
I need to perform a t-test for every row in DataSample in order to find out if groups TRIAL and CONTROL differ (equal variance applies).
Then I need to count the number of rows with a p-value equal to, or lower than 0.05.
So here is the code I tried, which I know is wrong:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
pValResults <- apply(
DataSample[,1:12],1,function(x) t.test(x,DataSample[,13:24], var.equal=T)$p.value
)
sum(pValResults < 0.05) # Returns the wrong answer (so I was told)
I did try looking at many similar questions around stackoverflow, but I would often end-up with syntax errors or a dimensional mismatch. The code above is the best I could get without returning me an R error -- but I since the code is returning the wrong answer I have nothing to feel proud of.
Any advice will be greatly appreciated! Thanks in advance for your time.
One option is to loop over the data set calculating the t test for each row, but it is not as elegant.
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
# initialize vector of stored p-values
pvalue <- rep(0,nrow(DataSample))
for (i in 1:nrow(DataSample)){
pvalue[i] <- t.test(DataSample[i,1:12],DataSample[i,13:24])$p.value
}
# finding number that are significant
sum(pvalue < 0.05)
I converted to a data.table, and the answer I got was 45:
DataSample.dt <- as.data.table(DataSample)
sum(sapply(seq_len(nrow(DataSample.dt)), function(x)
t.test(DataSample.dt[x, paste0('Trial', 1:12), with=F],
DataSample.dt[x, paste0('Control', 13:24), with=F],
var.equal=T)$p.value) < 0.05)
To do a paired T test, you need to supply the paired = TRUE parameter. The t.test function isn't vectorised, but it's quite simple to do t tests a whole matrix at a time. Here's three methods (including using apply):
library("genefilter")
library("matrixStats")
library("microbenchmark")
dd <- DataSample[, 1:12] - DataSample[, 13:24]
microbenchmark::microbenchmark(
manual = {ps1 <- 2 * pt(-abs(rowMeans(dd) / sqrt(rowVars(dd) / ncol(dd))), ncol(dd) - 1)},
apply = {ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], paired=TRUE)$p.value)},
rowttests = {ps3 <- rowttests(dd)[, "p.value"]})
#Unit: milliseconds
# expr min lq mean median uq max
# manual 1.611808 1.641783 1.677010 1.663122 1.709401 1.852347
# apply 390.869635 398.720930 404.391487 401.508382 405.715668 634.932675
# rowttests 2.368823 2.417837 2.639671 2.574320 2.757870 7.207135
# neval
# 100
# 100
# 100
You can see the manual method is over 200x faster than apply.
If you actually meant an unpaired test, here's the equivalent comparison:
microbenchmark::microbenchmark(
manual = {x <- DataSample[, 1:12]; y <- DataSample[, 13:24]; ps1 <- 2 * pt(-abs((rowMeans(x) - rowMeans(y)) / sqrt((rowVars(x) + rowVars(y)) / ncol(x))), ncol(DataSample) - 2)},
apply = { ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], var.equal = TRUE)$p.value)},
rowttests = {ps3 <- rowttests(DataSample, factor(rep(1:2, each = 12)))[, "p.value"]})
Note the manual method assumes that the two groups are the same sizes.
Adding an alternative using an external library.
Performing the test:
library(matrixTests)
res <- row_t_equalvar(DataSample[,1:12], DataSample[,13:24])
Format of the result:
res
obs.x obs.y obs.tot mean.x mean.y mean.diff var.x var.y var.pooled stderr df statistic pvalue conf.low conf.high alternative mean.null conf.level
1 12 12 24 0.30569721 0.160622830 0.145074376 0.5034806 1.0769678 0.7902242 0.3629105 22 0.399752487 0.69319351 -0.6075559 0.89770469 two.sided 0 0.95
2 12 12 24 -0.27463354 -0.206396781 -0.068236762 0.8133311 0.2807800 0.5470556 0.3019535 22 -0.225984324 0.82329990 -0.6944500 0.55797651 two.sided 0 0.95
3 12 12 24 -0.19805092 -0.023207888 -0.174843032 0.4278359 0.5604078 0.4941219 0.2869733 22 -0.609265949 0.54858909 -0.7699891 0.42030307 two.sided 0 0.95
Number of rows with p <= 0.05:
> sum(res$pvalue <= 0.05)
[1] 4
So I'm using Monte Carlo method to evaluate definite integral of a bunch of functions.
To start with,
y = x ^ (-0.5) ; for x in [0.01,1]
for which, my code in R looks like this
#
s <- NULL
m<- 100
a<- 0.01
b<- 1
set.seed(5)
x<-runif(m,a,b)
y<-runif(m,0,1)
for (i in 1:m){
if(y[i]<(x[i]^(-0.5))){
s[i] <- 1
}
else{
s[i] <-0
}
}
nn<-sum(s==1)*(b-a)/m
print(nn)
#
Answer (nn) : 0.99
Actual answer: 1.8
I cannot figure out where I'm going wrong with this. Have I done something wrong?
A number less than 1 to the power of something negative will always be greater than anything less than one, so you shouldn't be surprised when you get a vector of all 1s.
The rectangle you're using is too short (a height of 1). In reality, it should be 10 tall (since 0.01^-0.5=10) is the maximum value.
Then you take the total area of the rectangle and multiply it by the average of s, so the revised code looks like this:
s <- NULL
m<- 100
a<- 0.01
b<- 1
set.seed(5)
x<-runif(m,a,b)
y<-10*runif(m,0,1)
for (i in 1:m){
if(y[i]<(x[i]^(-0.5))){
s[i] <- 1
}
else{
s[i] <-0
}
}
nn<-sum(s)*(b-a)/m*10#note that the addition of the area of the rectangle
print(nn)
I got a result of 1.683, which is a lot closer to the real answer.
Edit: made a superfluous multiplication, answer revised slightly
As user1362215 points out, your function should be contained in the rectangle. You get closer to the solution if you increase n. Here is a vectorised solution. Results are in the range.
# Hit and miss
f <- function(x) x ^ (-0.5)
n <- 1000000
a <- 0.01
b <- 1
#ceiling(max(f((seq(0.01,1,by=0.001)))))
#[1] 10
set.seed(5)
x <- runif(n,a,b)
y <- 10*runif(n,0,1)
R <- sum(y < f(x))/n
(b-a)*10*R
#[1] 1.805701
# Repeat a few times to look at the distribution
set.seed(5)
n <- 100000
r <- replicate(1000,sum(10*runif(n,0,1) < f(runif(n,a,b)))/n *(b-a)*10)
hist(r)
summary(r)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.755 1.792 1.800 1.800 1.809 1.845
# Sample mean method for comparison
set.seed(5)
r <- replicate(1000, mean(f(runif(n, a,b)))*(b-a))
hist(r)
summary(r)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.788 1.798 1.800 1.800 1.803 1.813
Re your edit: I am assuming the x*2 + y^2, [-1,1] you are referring to a circle rather than a function f(z). So really to estimate area of unit circle/Pi by simulation.
f2 <- function(x) sqrt(1-x^2)
s <- seq(-1 , 1 ,by=0.001)
plot(s,f2(s))
# Get the max value of function within the range
c <- ceiling(max(f2(s)))
# [1] 1
n <- 1000000
a <- -1
b <- 1
set.seed(5)
x <- runif(n,a,b)
y <- c*runif(n,0,1)
R <- sum(y < f2(x))/n
(b-a)*c*R
#[1] 1.57063 # multiply it by 2 to get full area
pi/2
#[1] 1.570796
A Monte Carlo alternative to acceptance/rejection is to uniformly generate x values, average the resulting y = f(x) values to estimate the average height, and multiply that by the interval length to get the estimated area. I don't know R well enough, so here it is in Ruby to illustrate the algorithm:
def f(x)
x ** -0.5
end
sum = 0.0
10000.times { sum += f(0.01 + 0.99 * rand) }
print (1.0 - 0.01) * (sum / 10000)
I'm getting results in the range 1.8 +/- 0.02
You can also improve the precision of your estimator by using antithetic random variates - for each x you generate, also use the symmetric x value mirrored about the median of the x's.
Using #user20650's code for guidance for how to do this in R, you can estimate Pi / 2 as follows:
f <- function(x) sqrt(1-x^2)
n <- 100000
a <- -1
b <- 1
range <- b-a
set.seed(5)
r <- replicate(1000, mean(f(runif(n,a,b))) * range)
hist(r)
summary(r)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.566 1.570 1.571 1.571 1.572 1.575
No bounding function is needed for this approach, and generally it yields greater precision than the acceptance/rejection approach.
I have a dataset composed of values obtained from studies and experiments. Experiments are nested within studies. I want to subsample the dataset so that only 1 experiment is represented for each study. I want to repeat this procedure 10,000 times, randomly drawing the 1 experiment each time, and then calculate some summary statistics for the values. Here is an example dataset:
df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))
I wrote the following function to do the above, but it is taking forever. Does anyone have any suggestions for streamlining this code? Thanks!
subsample=function(x,A) {
subsample.list=sapply(1:A,function(m) {
idx=ddply(x,c("study"),function(i) sample(1:nrow(i),1)) #Sample one experiment from each study
x[paste(x$study,x$expt,sep="-") %in% paste(idx$study,idx$V1,sep="-"),"value"] } ) #Match the study-experiment combinations and retrieve values
means.list=ldply(subsample.list,mean) #Calculate the mean of 'values' for each iteration
c(quantile(means.list$V1,0.025),mean(means.list$V1),upper=quantile(means.list$V1,0.975)) } #Calculate overall means and 95% CIs
You can vectorise this way more (even using plyr), and go much much faster:
function=yoursummary(x)c(quantile(x,0.025),mean(x),upper=quantile(x,0.975))
subsampleX=function(x,M)
yoursummary(
aaply(
daply(.drop_o=F,df,.(study),
function(x)sample(x$value,M,replace=T)
),1,mean
)
)
The trick here is to do all the sampling up front. If we want to sample M times, why not do all that while you have access to the study.
Original code:
> system.time(subsample(df,20000))
user system elapsed
123.23 0.06 124.74
New vectorised code:
> system.time(subsampleX(df,20000))
user system elapsed
0.24 0.00 0.25
That's about 500x faster.
Here's a base R solution which avoids ddply for speed reasons:
df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))
sample.experiments <- function(df) {
r <- rle(df$study)
samp <- sapply( r$lengths , function(x) sample(seq(x),1) )
start.idx <- c(0,cumsum(r$lengths)[1:(length(r$lengths)-1)] )
df[samp + start.idx,]
}
> sample.experiments(df)
study expt value
1 1 1 0.6113196
4 2 2 0.5026527
6 3 1 0.2803080
7 4 1 0.9824377
Benchmarks
> m <- microbenchmark(
+ ddply(df,.(study),function(i) i[sample(1:nrow(i),1),]) ,
+ sample.experiments(df)
+ )
> m
Unit: microseconds
expr min lq median uq max
1 ddply(df, .(study), function(i) i[sample(1:nrow(i), 1), ]) 3808.652 3883.632 3936.805 4022.725 6530.506
2 sample.experiments(df) 337.327 350.734 357.644 365.915 580.097