Discarding samples according to a condition - r

I wish to write a loop in R within which Poisson samples are simulated, but I wish to discard samples that do not contain any zeros and "have another go". How may I do this?.
For example:
X<-rep(999,100)
for(j in 1:100){
x<-rpois(100,4)
X[j]<-mean(x)
}
Is there any way I could keep samples for which length(X[X==0])==0, and then reselect a sample, and continue until 100 means from samples which do contain zeros are obtained?

As #Frank suggested, a while loop is your best approach, though I don't think if is the best way to go.
NN <- 100
kk <- 100
lam <- 4
draws <- matrix(rpois(kk * NN, lam), ncol = NN)
while (!all(idx <- apply(draws, 2, all))){
draws[ , nidx] <- matrix(rpois(sum(nidx <- !idx) * NN, lam), ncol = NN)
}
Then to finish:
colMeans(draws)
An alternative is to use replicate:
colMeans(replicate(NN, {draws <- rpois(kk, lam)
while (!all(draws)) draws <- rpois(kk, lam)
draws}))
My quick benchmarks suggest this latter is actually faster.
Even more savvy would be to simply eliminate all bad draws from the start (and essentially draw from the truncated distribution).
We know that the probability of getting 0 on a given draw is exp(-lambda), so if we invert uniform draws on (exp(-lambda), 1], we'll be set:
colMeans(matrix(qpois(runif(kk * NN, min = exp(-lam)), lam), ncol = NN))
Also competitive with this is to use data.table:
library(data.table)
grps <- rep(1:NN, each = kk)
data.table(qpois(runif(kk * NN, min = exp(-lam)), lam))[ , mean(V1), grps]

Just to say that I have realised that if I edit Micheal's code to:
replicate(NN, {draws <- rpois(kk, lam)
while (all(draws)) draws <- rpois(kk, lam)
draws})
It will do what I wish. Thanks to all who answered.

Related

Split a set into n unequal subsets with the key deciding factor being that the elements in the subset aggregate and equal a predetermined amount?

I am looking towards a set of numbers and aiming to split them into subsets via set partitioning. The deciding factor on how these subsets will be generated will be ensuring that the sum of all the elements in the subset is as close as possible to a number generated by a pre-determined distribution. The subsets need not be the same size and each element can only be in one subset. I had previously been given guidance on this problem via the greedy algorithm (Link here), but I have found that some of the larger numbers in the set drastically skewed the results. I would therefore like to use some form of set partitioning for this problem.
A deeper underlying issue, which I would really like to correct for future problems, is I find I am drawn to the “brute force” approach with these type of problems. (As you can see from my code below which attempts to use folds to solve the problem via “brute force”). This is obviously a completely inefficient way to tackle the problem, and so I would like to tackle these minimization type problems with a more intelligent approach going forward. Therefore any advice is greatly appreciated.
library(groupdata2)
library(dplyr)
set.seed(345)
j <- runif(500,0,10000000)
dist <- c(.3,.2,.1,.05,.065,.185,.1)
s_diff <- 9999999999
for (i in 1:100) {
x <- fold(j, k = length(dist), method = "n_rand")
if (abs(sum(j) * dist[1] - sum(j[which(x$.folds==1)])) < abs(s_diff)) {
s_diff <- abs(sum(j) * dist[1] - sum(j[which(x$.folds==1)]))
x_fin <- x
}
}
This is just a simplified version only looking at the first ‘subset’. s_diff would be the smallest difference between the theoretical and actual results simulated, and x_fin would be which subset each element would be in (ie which fold it corresponds to). I was then looking to remove the elements that fell into the first subset and continue from there, but I know my method is inefficient.
Thanks in advance!
This is not a trivial problem, as you will probably gather from the complete lack of answers at 10 days, even with a bounty. As it happens, I think it is a great problem for thinking about algorithms and optimizations, so thanks for posting.
The first thing I would note is that you are absolutely right that this is not the kind of problem with which to try brute force. You may get near to a correct answer, but with a non-trivial number of samples and distribution points, you won't find the optimum solution. You need an iterative approach that moves elements about only if they make the fit better, and the algorithm needs to stop when it can't make it any better.
My approach here is to split the problem into three stages:
Cut the data into approximately the correct bins as a first approximation
Move elements from the bins that are a bit too big to the ones that are a bit too small. Do this iteratively until no more moves will optimize the bins.
Swap the elements between columns to fine tune the fit, until the swaps are optimal.
The reason to do it in this order is that each step is computationally more expensive, so you want to pass a better approximation to each step before letting it do its thing.
Let's start with a function to cut the data into approximately the correct bins:
cut_elements <- function(j, dist)
{
# Specify the sums that we want to achieve in each partition
partition_sizes <- dist * sum(j)
# The cumulative partition sizes give us our initial cuts
partitions <- cut(cumsum(j), cumsum(c(0, partition_sizes)))
# Name our partitions according to the given distribution
levels(partitions) <- levels(cut(seq(0,1,0.001), cumsum(c(0, dist))))
# Return our partitioned data as a data frame.
data.frame(data = j, group = partitions)
}
We want a way to assess how close this approximation (and subsequent approximations) are to our answer. We can plot against the target distribution, but it will also be helpful to have a numerical figure to assess the goodness of fit to include on our plot. Here, I will use the sum of the squares of the differences between the sample bins and the target bins. We'll use the log to make the numbers more comparable. The lower the number, the better the fit.
library(dplyr)
library(ggplot2)
library(tidyr)
compare_to_distribution <- function(df, dist, title = "Comparison")
{
df %>%
group_by(group) %>%
summarise(estimate = sum(data)/sum(j)) %>%
mutate(group = factor(cumsum(dist))) %>%
mutate(target = dist) %>%
pivot_longer(cols = c(estimate, target)) ->
plot_info
log_ss <- log(sum((plot_info$value[plot_info$name == "estimate"] -
plot_info$value[plot_info$name == "target"])^2))
ggplot(data = plot_info, aes(x = group, y = value, fill = name)) +
geom_col(position = "dodge") +
labs(title = paste(title, ": log sum of squares =", round(log_ss, 2)))
}
So now we can do:
cut_elements(j, dist) %>% compare_to_distribution(dist, title = "Cuts only")
We can see that the fit is already pretty good with a simple cut of the data, but we can do a lot better by moving appropriately sized elements from the over-sized bins to the under-sized bins. We do this iteratively until no more moves will improve our fit. We use two nested while loops, which should make us worry about computation time, but we have started with a close match, so we shouldn't get too many moves before the loop stops:
move_elements <- function(df, dist)
{
ignore_max = length(dist);
while(ignore_max > 0)
{
ignore_min = 1
match_found = FALSE
while(ignore_min < ignore_max)
{
group_diffs <- sort(tapply(df$data, df$group, sum) - dist*sum(df$data))
group_diffs <- group_diffs[ignore_min:ignore_max]
too_big <- which.max(group_diffs)
too_small <- which.min(group_diffs)
swap_size <- (group_diffs[too_big] - group_diffs[too_small])/2
which_big <- which(df$group == names(too_big))
candidate_row <- which_big[which.min(abs(swap_size - df[which_big, 1]))]
if(df$data[candidate_row] < 2 * swap_size)
{
df$group[candidate_row] <- names(too_small)
ignore_max <- length(dist)
match_found <- TRUE
break
}
else
{
ignore_min <- ignore_min + 1
}
}
if (match_found == FALSE) ignore_max <- ignore_max - 1
}
return(df)
}
Let's see what that has done:
cut_elements(j, dist) %>%
move_elements(dist) %>%
compare_to_distribution(dist, title = "Cuts and moves")
You can see now that the match is so close we are struggling to see whether there is any difference between the target and the partitioned data. That's why we needed the numerical measure of GOF.
Still, let's get this fit as good as possible by swapping elements between columns to fine-tune them. This step is computationally expensive, but again we are already giving it a close approximation, so it shouldn't have much to do:
swap_elements <- function(df, dist)
{
ignore_max = length(dist);
while(ignore_max > 0)
{
ignore_min = 1
match_found = FALSE
while(ignore_min < ignore_max)
{
group_diffs <- sort(tapply(df$data, df$group, sum) - dist*sum(df$data))
too_big <- which.max(group_diffs)
too_small <- which.min(group_diffs)
current_excess <- group_diffs[too_big]
current_defic <- group_diffs[too_small]
current_ss <- current_excess^2 + current_defic^2
all_pairs <- expand.grid(df$data[df$group == names(too_big)],
df$data[df$group == names(too_small)])
all_pairs$diff <- all_pairs[,1] - all_pairs[,2]
all_pairs$resultant_big <- current_excess - all_pairs$diff
all_pairs$resultant_small <- current_defic + all_pairs$diff
all_pairs$sum_sq <- all_pairs$resultant_big^2 + all_pairs$resultant_small^2
improvements <- which(all_pairs$sum_sq < current_ss)
if(length(improvements) > 0)
{
swap_this <- improvements[which.min(all_pairs$sum_sq[improvements])]
r1 <- which(df$data == all_pairs[swap_this, 1] & df$group == names(too_big))[1]
r2 <- which(df$data == all_pairs[swap_this, 2] & df$group == names(too_small))[1]
df$group[r1] <- names(too_small)
df$group[r2] <- names(too_big)
ignore_max <- length(dist)
match_found <- TRUE
break
}
else ignore_min <- ignore_min + 1
}
if (match_found == FALSE) ignore_max <- ignore_max - 1
}
return(df)
}
Let's see what that has done:
cut_elements(j, dist) %>%
move_elements(dist) %>%
swap_elements(dist) %>%
compare_to_distribution(dist, title = "Cuts, moves and swaps")
Pretty close to identical. Let's quantify that:
tapply(df$data, df$group, sum)/sum(j)
# (0,0.3] (0.3,0.5] (0.5,0.6] (0.6,0.65] (0.65,0.715] (0.715,0.9]
# 0.30000025 0.20000011 0.10000014 0.05000010 0.06499946 0.18500025
# (0.9,1]
# 0.09999969
So, we have an exceptionally close match: each partition is less than one part in one million away from the target distribution. Quite impressive considering we only had 500 measurements to put into 7 bins.
In terms of retrieving your data, we haven't touched the ordering of j within the data frame df:
all(df$data == j)
# [1] TRUE
and the partitions are all contained within df$group. So if we want a single function to return just the partitions of j given dist, we can just do:
partition_to_distribution <- function(data, distribution)
{
cut_elements(data, distribution) %>%
move_elements(distribution) %>%
swap_elements(distribution) %>%
`[`(,2)
}
In conclusion, we have created an algorithm that creates an exceptionally close match. However, that's no good if it takes too long to run. Let's test it out:
microbenchmark::microbenchmark(partition_to_distribution(j, dist), times = 100)
# Unit: milliseconds
# expr min lq mean median uq
# partition_to_distribution(j, dist) 47.23613 47.56924 49.95605 47.78841 52.60657
# max neval
# 93.00016 100
Only 50 milliseconds to fit 500 samples. Seems good enough for most applications. It would grow exponentially with larger samples (about 10 seconds on my PC for 10,000 samples), but by that point the relative fineness of the samples means that cut_elements %>% move_elements already gives you a log sum of squares of below -30 and would therefore be an exceptionally good match without the fine tuning of swap_elements. These would only take about 30 ms with 10,000 samples.
To add to the excellent answer by #AllanCameron, here is a solution that utilizes the highly efficient function comboGeneral from RcppAlgos*.
library(RcppAlgos)
partDist <- function(v, d, tol_ratio = 0.0001) {
tot_sum <- d * sum(v)
orig_len <- length(v)
tot_len <- d * orig_len
df <- do.call(rbind, lapply(1L:(length(d) - 1L), function(i) {
len <- as.integer(tot_len[i])
vals <- comboGeneral(v, len,
constraintFun = "sum",
comparisonFun = "==",
limitConstraints = tot_sum[i],
tolerance = tol_ratio * tot_sum[i],
upper = 1)
ind <- match(vals, v)
v <<- v[-ind]
data.frame(data = as.vector(vals), group = rep(paste0("g", i), len))
}))
len <- orig_len - nrow(df)
rbind(df, data.frame(data = v,
group = rep(paste0("g", length(d)), len)))
}
The idea is that we find a subset of v (e.g. j in the OP's case) such that the sum is within a tolerance of sum(v) * d[i] for some index i (d is equivalent to dist in the OP's example). After we find a solution (N.B. we are putting a cap on the number of solutions by setting upper = 1), we assign them to a group, and then remove them from v. We then iterate until we are left with just enough elements in v that will be assigned to the last distributed value (e.g. dist[length[dist]].
Here is an example using the OP's data:
set.seed(345)
j <- runif(500,0,10000000)
dist <- c(.3,.2,.1,.05,.065,.185,.1)
system.time(df_op <- partDist(j, dist, 0.0000001))
user system elapsed
0.019 0.000 0.019
And using the function for plotting by #AllanCameron we have:
df_op %>% compare_to_distribution(dist, "RcppAlgos OP Ex")
What about a larger sample with the same distribution:
set.seed(123)
j <- runif(10000,0,10000000)
## N.B. Very small ratio
system.time(df_huge <- partDist(j, dist, 0.000000001))
user system elapsed
0.070 0.000 0.071
The results:
df_huge %>% compare_to_distribution(dist, "RcppAlgos Large Ex")
As you can see, the solutions scales very well. We can speed up execution by loosening tol_ratio at the expense of the quality of the result.
For reference with the large data set, the solution given by #AllanCameron takes just under 3 seconds and gives a similar log sum of squares values (~44):
system.time(allan_large <- partition_to_distribution(j, dist))
user system elapsed
2.261 0.675 2.938
* I am the author of RcppAlgos

Assigning value to dataframe in R - for loop speed

I have the following code:
n <- 1e6
no_clm <- rpois(n,30)
hold <- data.frame("x" = double(n))
c = 1
for (i in no_clm){
ctl <- sum(rgamma(i,30000)-2000)
hold[c,1] <- ctl
#hold <- rbind(hold,df)
c = c +1
}
Unfortunately the speed of this code is quite slow. I've narrowed down the speed to hold[c,1] <- ctl. If I remove this then the code runs near instantly.
How can I make this efficient? I need to store the results to some sort of dataframe or list in a fast fashion. In reality the actual code is more complex than this but the slowing point is the assigning.
Note that the above is just an example, in reality I have multiple calculations on the rgamma samples and each of these calculations are then stored in a large dataframe.
Try this
hold=data.frame(sapply(no_clm,function(x){
return(sum(rgamma(x,30000)-2000))
}))
It looks like you can just use one call to rgamma, as you are iterating over the number of observations parameter.
So if you do one call and the split the vector to the lengths required (no_clm) you can then just iterate over that list and sum
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame("x" = double(n))
# total observations to use for rgamma
total_clm <- sum(no_clm)
# get values
gammas <- rgamma(total_clm, 30000) - 2000
# split into list of lengths dictated by no_clm
hold$x <- sapply(split(gammas, cumsum(sequence(no_clm) == 1)), sum)
This took 5.919892 seconds
Move into sapply() loop instead of a for loop and then realise 2000 * no_clm can be moved outside the loop (to minimise number of function calls).
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame(x = sapply(no_clm, function(i) sum(rgamma(i, 30000))) - 2000 * no_clm)
You may observe a speed pickup using data.table:
dt = data.table(no_clm)
dt[, hold := sapply(no_clm, function(x) sum(rgamma(x, 30000)-2000))]

Efficient way to generate permutations of 0 and 1?

What I am trying to do is generate all possible permutations of 1 and 0 given a particular sample size. For instance with a sample of n=8 I would like the m = 2^8 = 256 possible permutations, i.e:
I've written a function in R to do this, but after n=11 it takes a very long time to run. I would prefer a solution in R, but if its in another programming language I can probably figure it out. Thanks!
PermBinary <- function(n){
n.perms <- 2^n
array <- matrix(0,nrow=n,ncol=n.perms)
# array <- big.matrix(n, n.perms, type='integer', init=-5)
for(i in 1:n){
div.length <- ncol(array)/(2^i)
div.num <- ncol(array)/div.length
end <- 0
while(end!=ncol(array)){
end <- end +1
start <- end + div.length
end <- start + div.length -1
array[i,start:end] <- 1
}
}
return(array)
}
expand.grid is probably the best vehicle to get what you want.
For example if you wanted a sample size of 3 we could do something like
expand.grid(0:1, 0:1, 0:1)
For a sample size of 4
expand.grid(0:1, 0:1, 0:1, 0:1)
So what we want to do is find a way to automate that call.
If we had a list of the inputs we want to give to expand.grid we could use do.call to construct the call for us. For example
vals <- 0:1
tmp <- list(vals, vals, vals)
do.call(expand.grid, tmp)
So now the challenge is to automatically make the "tmp" list above in a fashion that we can dictate how many copies of "vals" we want. There are lots of ways to do this but one way is to use replicate. Since we want a list we'll need to tell it to not simplify the result or else we will get a matrix/array as the result.
vals <- 0:1
tmp <- replicate(4, vals, simplify = FALSE)
do.call(expand.grid, tmp)
Alternatively we can use rep on a list input (which I believe is faster because it doesn't have as much overhead as replicate but I haven't tested it)
tmp <- rep(list(vals), 4)
do.call(expand.grid, tmp)
Now wrap that up into a function to get:
binarypermutations <- function(n, vals = 0:1){
tmp <- rep(list(vals), n)
do.call(expand.grid, tmp)
}
Then call with the sample size like so binarypermutations(5).
This gives a data.frame of dimensions 2^n x n as a result - transpose and convert to a different data type if you'd like.
The answer above may be better since it uses base - my first thought was to use data.table's CJ function:
library(data.table)
do.call(CJ, replicate(8, c(0, 1), FALSE))
It will be slightly faster (~15%) than expand.grid, so it will only be more valuable for extreme cases.

Optimizing a vectorized function using apply, compiler, or other techniques

I'm seeking to optimize this algorithm smartWindow and (and the process where I original post which explains some context around the function and how I got here:
Vectorizing a loop through lines of data frame R while accessing multiple variables the dataframe).
This currently takes me 240 seconds to run on my actual data. I've tried some Rprof It seems that chg2 <- line of smartWindow is eating the most time. I've also tried the compiler in R using cmpfun I'm wondering there's a way to significantly improve the speed of what I'm trying to do.
What I'm really looking for, is if there's a technique to accomplish what I've done below in something closer to 20 seconds than 240 seconds. I've shaved off 1-5% of of the computation time using various things. but what I'm really wondering is if I can decrease the time by a factor of a number greater than 2.
## the function
smartWindow <- function(tdate, aid, chgdf, datev='Submit.Date', assetv='Asset.ID', fdays=30, bdays=30) {
fdays <- tdate+fdays
bdays <- tdate-bdays
chg2 <- chgdf[chgdf[,assetv]==aid & chgdf[,datev]<fdays & chgdf[,datev]>bdays, ]
ret <- nrow(chg2)
return(ret)
}
## set up some data #################################################
dates <- seq(as.Date('2011-01-01'), as.Date('2013-12-31'), by='days')
aids <- paste(rep(letters[1:26], 3), 1:3, sep='')
n <- 3000
inc <- data.frame(
Submit.Date = sample(dates, n, replace=T),
Asset.ID = sample(aids, n, replace=T))
chg <- data.frame(
Submit.Date = sample(dates, n, replace=T),
Asset.ID = sample(aids, n, replace=T))
## applying function to just one incident ###########################
smartWindow(inc$Submit.Date[1], inc$Asset.ID[1], chgdf=chg, bdays=100)
## applying to every incident... this is process i seek to optimize #########
system.time({
inc$chg_b30 <- apply(inc[,c('Submit.Date', 'Asset.ID')], 1, function(row) smartWindow(as.Date(row[1]), row[2], chgdf=chg,
datev='Submit.Date', assetv='Asset.ID', bdays=30, fdays=0))
})
table(inc$chg_b30)

How can I speed up this sapply for cross checking samples?

I'm trying to speed up a QC function for checking similarity between samples. I wanted to know if there is a faster way to compare the way I am doing below? I know there have been answers to this kind of question that are pretty definitive (on SO or otherwise) but I can't find them. I know I should investigate plyr but I'm still getting a hold of sapply.
The following sample data is a representative output of what I would be working but randomized and I don't think would impact the application to my original question.
## sample data
nSamples <- 1000
nSamplesQC <- 100
nAssays <- 96
microarrayScores <- matrix(sample(c("G:G", "T:G", "T:T", NA),nSamples * nAssays,replace = TRUE), nrow = nSamples, ncol = nAssays)
microarrayScoresQC <- matrix(sample(c("G:G", "T:G", "T:T", NA),nSamples * nAssays,replace = TRUE), nrow = nSamples, ncol = nAssays)
mycombs <- data.frame(Experiment = rep(1:nSamples,nSamplesQC),QC = sort(rep(1:nSamplesQC,nSamples)))
## testing function
system.time(
sapply(seq(length(mycombs[,1])), function(x) {compare <- microarrayScores[mycombs[x,1],]==microarrayScoresQC[mycombs[x,2],];
sum(compare[!is.na(compare)])/sum(!is.na(compare))})
)
Here is a vectorized version of your code, about 20 times faster on my machine:
rowMeans(microarrayScores[mycombs[,1], ] ==
microarrayScoresQC[mycombs[,2], ], na.rm = TRUE)
Something like this:
foo <- function(x){
compare <- microarrayScores[x[1],]==microarrayScoresQC[x[2],]
sum(compare[!is.na(compare)])/sum(!is.na(compare))
}
system.time(apply(mycombs,1,foo))
appears to be modestly faster. (Maybe 2-3x)

Resources