Generate a random subset of the powerset directly - r

It is easy to generate a random subset of the powerset if we are able to compute all elements of the powerset first and then randomly draw a sample out of it:
set.seed(12)
x = 1:4
n.samples = 3
library(HapEstXXR)
power.set = HapEstXXR::powerset(x)
sample(power.set, size = n.samples, replace = FALSE)
# [[1]]
# [1] 2
#
# [[2]]
# [1] 3 4
#
# [[3]]
# [1] 1 3 4
However, if the length of x is large, there will be too many elements for the powerset. I am therefore looking for a way to directly compute a random subset.
One possibility is to first draw a "random length" and then draw random subset of x using the "random length":
len = sample(1:length(x), size = n.samples, replace = TRUE)
len
# [1] 2 1 1
lapply(len, function(l) sort(sample(x, size = l)))
# [[1]]
# [1] 1 2
#
# [[2]]
# [1] 1
#
# [[3]]
# [1] 1
This, however, generates duplicates. Of course, I could now remove the duplicates and repeat the previous sampling using a while loop until I end up with n.samples non-duplicate random subsets of the powerset:
drawSubsetOfPowerset = function(x, n) {
ret = list()
while(length(ret) < n) {
# draw a "random length" with some meaningful prob to reduce number of loops
len = sample(0:n, size = n, replace = TRUE, prob = choose(n, 0:n)/2^n)
# draw random subset of x using the "random length" and sort it to better identify duplicates
random.subset = lapply(len, function(l) sort(sample(x, size = l)))
# remove duplicates
ret = unique(c(ret, random.subset))
}
return(ret)
}
drawSubsetOfPowerset(x, n.samples)
Of course, I could now try to optimize several components of my drawSubsetOfPowerset function, e.g. (1) trying to avoid the copying of the object ret in each iteration of the loop, (2) using a faster sort, (3) using a faster way to remove duplicates of the list, ...
My question is: Is there maybe a different way (which is more efficient) of doing this?

How about using binary representation? This way we can generate a random subset of integers from the length of the total number of power sets given by 2^length(v). From there we can make use of intToBits along with indexing to guarantee we generate random unique subsets of the power set in an ordered fashion.
randomSubsetOfPowSet <- function(v, n, mySeed) {
set.seed(mySeed)
lapply(sample(2^length(v), n) - 1, function(x) v[intToBits(x) > 0])
}
Taking x = 1:4, n.samples = 5, and a random seed of 42, we have:
randomSubsetOfPowSet(1:4, 5, 42)
[[1]]
[1] 2 3 4
[[2]]
[1] 1 2 3 4
[[3]]
[1] 3
[[4]]
[1] 2 4
[[5]]
[1] 1 2 3
Explanation
What does binary representation have to do with power sets?
It turns out that given a set, we can find all subsets by turning to bits (yes, 0s and 1s). By viewing the elements in a subset as on elements in the original set and the elements not in that subset as off, we now have a very tangible way of thinking about how to generate each subset. Observe:
Original set: {a, b, c, d}
| | | |
V V V V b & d
Existence in subset: 1/0 1/0 1/0 1/0 are on
/ \
/ \
| |
V V
Example subset: {b, d} gets mapped to {0, 1, 0, 1}
| \ \ \_______
| | \__ \
| |___ \____ \____
| | | |
V V V V
Thus, {b, d} is mapped to the integer 0*2^0 + 1*2^1 + 0*2^2 + 1*2^3 = 10
This is now a problem of combinations of bits of length n. If you map this out for every subset of A = {a, b, c, d}, you will obtain 0:15. Therefore, to obtain a random subset of the power set of A, we simply generate a random subset of 0:15 and map each integer to a subset of A. How might we do this?
sample comes to mind.
Now, it is very easy to go the other way as well (i.e. from an integer to a subset of our original set)
Observe:
Given the integer 10 and set A given above (i.e. {a, b, c, d}) we have:
10 in bits is -->> {0, 1, 0, 1}
Which indices are greater than 0?
Answer: 2 and 4
Taking the 2nd the 4th element of our set gives: {b, d} et Voila!

Related

Generating a Random Permutation in R

I try to implement a example using R in Simulation (2006, 4ed., Elsevier) by Sheldon M. Ross, which wants to generate a random permutation and reads as follows:
Suppose we are interested in generating a permutation of the numbers 1,2,... ,n
which is such that all n! possible orderings are equally likely.
The following algorithm will accomplish this by
first choosing one of the numbers 1,2,... ,n at random;
and then putting that number in position n;
it then chooses at random one of the remaining n-1 numbers and puts that number in position n-1 ;
it then chooses at random one of the remaining n-2 numbers and puts it in position n-2 ;
and so on
Surely, we can achieve a random permutation of the numbers 1,2,... ,n easily by
sample(1:n, replace=FALSE)
For example
> set.seed(0); sample(1:5, replace=FALSE)
[1] 1 4 3 5 2
However, I want to get similar results manually according to the above algorithmic steps. Then I try
## write the function
my_perm = function(n){
x = 1:n # initialize
k = n # position n
out = NULL
while(k>0){
y = sample(x, size=1) # choose one of the numbers at random
out = c(y,out) # put the number in position
x = setdiff(x,out) # the remaining numbers
k = k-1 # and so on
}
out
}
## test the function
n = 5; set.seed(0); my_perm(n) # set.seed for reproducible
and have
[1] 2 2 4 5 1
which is obviously incorrect for there are two 2 . How can I fix the problem?
You have implemented the logic correctly but there is only one thing that you need to be aware which is related to R.
From ?sample
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x
So when the last number is remaining in x, let's say that number is 4, sampling would take place from 1:4 and return any 1 number from it.
For example,
set.seed(0)
sample(4, 1)
#[1] 2
So you need to adjust your function for that after which the code should work correctly.
my_perm = function(n){
x = 1:n # initialize
k = n # position n
out = NULL
while(k>1){ #Stop the while loop when k = 1
y = sample(x, size=1) # choose one of the numbers at random
out = c(y,out) # put the number in position
x = setdiff(x,out) # the remaining numbers
k = k-1 # and so on
}
out <- c(x, out) #Add the last number in the output vector.
out
}
## test the function
n = 5
set.seed(0)
my_perm(n)
#[1] 3 2 4 5 1
Sample size should longer than 1. You can break it by writing a condition ;
my_perm = function(n){
x = 1:n
k = n
out = NULL
while(k>0){
if(length(x)>1){
y = sample(x, size=1)
}else{
y = x
}
out = c(y,out)
x = setdiff(x,out)
k = k-1
}
out
}
n = 5; set.seed(0); my_perm(n)
[1] 3 2 4 5 1

Generate non-negative (or positive) random integers that sum to a fixed value

I would like to randomly assign positive integers to G groups, such that they sum up to V.
For example, if G = 3 and V = 21, valid results may be (7, 7, 7), (10, 6, 5), etc.
Is there a straightforward way to do this?
Editor's notice (from 李哲源):
If values are not restricted to integers, the problem is simple and has been addressed in Choosing n numbers with fixed sum.
For integers, there is a previous Q & A: Generate N random integers that sum to M in R but it appears more complicated and is hard to follow. The loop based solution over there is also not satisfying.
non-negative integers
Let n be sample size:
x <- rmultinom(n, V, rep.int(1 / G, G))
is a G x n matrix, where each column is a multinomial sample that sums up to V.
By passing rep.int(1 / G, G) to argument prob I assume that each group has equal probability of "success".
positive integers
As Gregor mentions, a multinomial sample can contain 0. If such samples are undesired, they should be rejected. As a result, we sample from a truncated multinomial distribution.
In How to generate target number of samples from a distribution under a rejection criterion I suggested an "over-sampling" approach to achieve "vectorization" for a truncated sampling. Simply put, Knowing the acceptance probability we can estimate the expected number of trials M to see the first "success" (non-zero). We first sample say 1.25 * M samples, then there will be at least one "success" in these samples. We randomly return one as the output.
The following function implements this idea to generate truncated multinomial samples without 0.
positive_rmultinom <- function (n, V, prob) {
## input validation
G <- length(prob)
if (G > V) stop("'G > V' causes 0 in a sample for sure!")
if (any(prob < 0)) stop("'prob' can not contain negative values!")
## normalization
sum_prob <- sum(prob)
if (sum_prob != 1) prob <- prob / sum_prob
## minimal probability
min_prob <- min(prob)
## expected number of trials to get a "success" on the group with min_prob
M <- round(1.25 * 1 / min_prob)
## sampling
N <- n * M
x <- rmultinom(N, V, prob)
keep <- which(colSums(x == 0) == 0)
x[, sample(keep, n)]
}
Now let's try
V <- 76
prob <- c(53, 13, 9, 1)
Directly using rmultinom to draw samples can occasionally result in ones with 0:
## number of samples that contain 0 in 1000 trials
sum(colSums(rmultinom(1000, V, prob) == 0) > 0)
#[1] 355 ## or some other value greater than 0
But there is no such issue by using positive_rmultinom:
## number of samples that contain 0 in 1000 trials
sum(colSums(positive_rmultinom(1000, V, prob) == 0) > 0)
#[1] 0
Probably a less expensive way, but this seems to work.
G <- 3
V <- 21
m <- data.frame(matrix(rep(1:V,G),V,G))
tmp <- expand.grid(m) # all possibilities
out <- tmp[which(rowSums(tmp) == V),] # pluck those that sum to 'V'
out[sample(1:nrow(out),1),] # randomly select a column
Not sure how to do with runif
I figured out what I believe to be a much simpler solution. You first generate random integers from your minimum to maximum range, count them up and then make a vector of the counts (including zeros).
Note that this solution may include zeros even if the minimum value is greater than zero.
Hope this helps future r people with this problem :)
rand.vect.with.total <- function(min, max, total) {
# generate random numbers
x <- sample(min:max, total, replace=TRUE)
# count numbers
sum.x <- table(x)
# convert count to index position
out = vector()
for (i in 1:length(min:max)) {
out[i] <- sum.x[as.character(i)]
}
out[is.na(out)] <- 0
return(out)
}
rand.vect.with.total(0, 3, 5)
# [1] 3 1 1 0
rand.vect.with.total(1, 5, 10)
#[1] 4 1 3 0 2
Note, I also posted this here Generate N random integers that sum to M in R, but this answer is relevant to both questions.

Is there a general algorithm to identify a numeric series?

I am looking for a general purpose algorithm to identify short numeric series from lists with a max length of a few hundred numbers. This will be used to identify series of masses from mass spectrometry (ms1) data.
For instance, given the following list, I would like to identify that 3 of these numbers fit the series N + 1, N +2, etc.
426.24 <= N
427.24 <= N + 1/x
371.10
428.24 <= N + 2/x
851.47
451.16
The series are all of the format: N, N+1/x, N+2/x, N+3/x, N+4/x, etc, where x is an integer (in the example x=1). I think this constraint makes the problem very tractable. Any suggestions for a quick/efficient way to tackle this in R?
This routine will generate series using x from 1 to 10 (you could increase it). And will check how many are contained in the original list of numbers.
N = c(426.24,427.24,371.1,428.24,851.24,451.16)
N0 = N[1]
x = list(1,2,3,4,5,6,7,8,9,10)
L = 20
Series = lapply(x, function(x){seq(from = N0, by = 1/x,length.out = L)})
countCoincidences = lapply(Series, function(x){sum(x %in% N)})
Result:
unlist(countCoincidences)
[1] 3 3 3 3 3 3 3 3 3 2
As you can see, using x = 1 will have 3 coincidences. The same goes for all x until x=9. Here you have to decide which x is the one you want.
Since you're looking for an arithmetic sequence, the difference k is constant. Thus, you can loop over the vector and subtract each value from the sequence. If you have a sequence, subtracting the second term from the vector will result in values of -k, 0, and k, so you can find the sequence by looking for matches between vector - value and its opposite, value - vector:
x <- c(426.24, 427.24, 371.1, 428.24, 851.47, 451.16)
unique(lapply(x, function(y){
s <- (x - y) %in% (y - x);
if(sum(s) > 1){x[s]}
}))
# [[1]]
# NULL
#
# [[2]]
# [1] 426.24 427.24 428.24

fill up a matrix one random cell at a time

I am filling a 10x10 martix (mat) randomly until sum(mat) == 100
I wrote the following.... (i = 2 for another reason not specified here but i kept it at 2 to be consistent with my actual code)
mat <- matrix(rep(0, 100), nrow = 10)
mat[1,] <- c(0,0,0,0,0,0,0,0,0,1)
mat[2,] <- c(0,0,0,0,0,0,0,0,1,0)
mat[3,] <- c(0,0,0,0,0,0,0,1,0,0)
mat[4,] <- c(0,0,0,0,0,0,1,0,0,0)
mat[5,] <- c(0,0,0,0,0,1,0,0,0,0)
mat[6,] <- c(0,0,0,0,1,0,0,0,0,0)
mat[7,] <- c(0,0,0,1,0,0,0,0,0,0)
mat[8,] <- c(0,0,1,0,0,0,0,0,0,0)
mat[9,] <- c(0,1,0,0,0,0,0,0,0,0)
mat[10,] <- c(1,0,0,0,0,0,0,0,0,0)
i <- 2
set.seed(129)
while( sum(mat) < 100 ) {
# pick random cell
rnum <- sample( which(mat < 1), 1 )
mat[rnum] <- 1
##
print(paste0("i =", i))
print(paste0("rnum =", rnum))
print(sum(mat))
i = i + 1
}
For some reason when sum(mat) == 99 there are several steps extra...I would assume that once i = 91 the while would stop but it continues past this. Can somone explain what I have done wrong...
If I change the while condition to
while( sum(mat) < 100 & length(which(mat < 1)) > 0 )
the issue remains..
Your problem is equivalent to randomly ordering the indices of a matrix that are equal to 0. You can do this in one line with sample(which(mat < 1)). I suppose if you wanted to get exactly the same sort of output, you might try something like:
set.seed(144)
idx <- sample(which(mat < 1))
for (i in seq_along(idx)) {
print(paste0("i =", i))
print(paste0("rnum =", idx[i]))
print(sum(mat)+i)
}
# [1] "i =1"
# [1] "rnum =5"
# [1] 11
# [1] "i =2"
# [1] "rnum =70"
# [1] 12
# ...
See ?sample
Arguments:
x: Either a vector of one or more elements from which to choose,
or a positive integer. See ‘Details.’
...
If ‘x’ has length 1, is numeric (in the sense of ‘is.numeric’) and
‘x >= 1’, sampling _via_ ‘sample’ takes place from ‘1:x’. _Note_
that this convenience feature may lead to undesired behaviour when
‘x’ is of varying length in calls such as ‘sample(x)’. See the
examples.
In other words, if x in sample(x) is of length 1, sample returns a random number from 1:x. This happens towards the end of your loop, where there is just one 0 left in your matrix and one index is returned by which(mat < 1).
The iteration repeats on level 99 because sample() behaves very differently when the first parameter is a vector of length 1 and when it is greater than 1. When it is length 1, it assumes you a random number from 1 to that number. When it has length >1, then you get a random number from that vector.
Compare
sample(c(99,100),1)
and
sample(c(100),1)
Of course, this is an inefficient way of filling your matrix. As #josilber pointed out, a single call to sample could do everything you need.
The issue comes from how sample and which do the sampling when you have only a single '0' value left.
For example, do this:
mat <- matrix(rep(1, 100), nrow = 10)
Now you have a matrix of all 1's. Now lets make two numbers 0:
mat[15]<-0
mat[18]<-0
and then sample
sample(which(mat<1))
[1] 18 15
by adding a size=1 argument you get one or the other
now lets try this:
mat[18]<-1
sample(which(mat<1))
[1] 3 13 8 2 4 14 11 9 10 5 15 7 1 12 6
Oops, you did not get [1] 15 . Instead what happens in only a single integer (15 in this case) is passed tosample. When you do sample(x) and x is an integer, it gives you a sample from 1:x with the integers in random order.

Dictionaries and pairs

In R I was wondering if I could have a dictionary (in a sense like python) where I have a pair (i, j) as the key with a corresponding integer value. I have not seen a clean or intuitive way to construct this in R. A visual of my dictionary would be:
(1, 2) --> 1
(1, 3) --> 3
(1, 4) --> 4
(1, 5) --> 3
EDIT: The line of code to insert these key value pairs is in a loop with counters i and j. For example suppose I have:
for(i in 1: 5)
{
for(j in 2: 4)
{
maps[i][j] = which.min(someVector)
}
}
How do I change maps[i][j] to get the functionality I am looking for?
Both named lists and environments provide a mapping, but between a character string (the name, which acts as a key) and an arbitrary R object (the value). If your i and j are simple (as they appear in your example to be, being integers), you can easily make a unique string out of the pair of them by concatenating them with some delimiter. This would make a valid name/key
mykey <- function(i, j) {
paste(i, j, sep="|")
}
maps <- list()
for(i in 1: 5) {
for(j in 2: 4) {
maps[mykey(i,j)] <- which.min(someVector)
}
}
You can extract any value for a specific i and j with
maps[[mykey(i,j)]]
Here's how you'd do this in a data.table, which will be fast and easily extendable:
library(data.table)
d = data.table(i = 1, j = 2:5, value = c(1,3,4,3), key = c("i", "j"))
# access the i=1,j=3 value
d[J(1, 3)][, value]
# change that value
d[J(1, 3), value := 12]
# do some vector assignment (you should stop thinking loops, and start thinking vectors)
d[, value := i * j]
etc.
You can do this with a list of vectors.
maps <- lapply(vector('list',5), function(i) integer(0))
maps[[1]][2] <- 1
maps[[1]][3] <- 3
maps[[1]][4] <- 4
maps[[1]][5] <- 3
That said, there's probably a better way to do what you're trying to do, but you haven't given us enough background.
You can just use a data.frame:
a = data.frame(spam = c("alpha", "gamma", "beta"),
shrub = c("primus","inter","parus"),
stringsAsFactors = FALSE)
rownames(a) = c("John", "Eli","Seth")
> a
spam shrub
John alpha primus
Eli gamma inter
Seth beta parus
> a["John", "spam"]
[1] "alpha"
This handles the case with a 2d dictionary style object with named keys. The keys can also be integers, although they might have to be characters in stead of numeric's.
R matrices allow you to do this. There are both sparse and dense version. I beleive the tm-package uses a variation on sparse matrices to form its implementation of dictionaries. This shows how to extract the [i,j] elements of matrix M where [i,j] is represented as a a two-column matrix.
M<- matrix(1:20, 5, 5)
ij <- cbind(sample(1:5), sample(1:5) )
> ij
[,1] [,2]
[1,] 4 4
[2,] 1 2
[3,] 5 3
[4,] 3 1
[5,] 2 5
> M[ij]
[1] 19 6 15 3 2
#Justin also points out that you could use lists which can be indexed by position:
L <- list( as.list(letters[1:5] ), as.list( paste(2,letters[1:5] ) ) )
> L[[1]][[2]]
[1] "b"
> L[[2]][[2]]
[1] "2 b"

Resources