Say I have a vector of random numbers, I can order them lowest to highest:
set.seed(1)
x <- runif(20)
v <- x[order(x)]
Now, say I want to order them but with some degree of noise.
I can randomly move elements like this:
z <-sample(1:20,2)
replace(v, z, v[rev(z)])
but this doesn't necessarily move closely related values. I could be equally likely to randomly switch the 1st and 20th values as the 5th and 6th. I would like to have some control over the switching, so I can switch more closely related values.
Ideally, I would be able to reorder the vector to have a specific Spearman's correlation. Say rather than the Spearman correlation of rank order being 1 when they are perfectly ordered, is there a way to reorder that same vector of numbers to have e.g. a Spearman's correlation of 0.5 ?
What if you added some noise to their rankings. This will makes sure values don't get moved too far away from the starting point. For example
set.seed(1)
N <- 50
D <- 3 # controls how far things can move
x <- runif(N)
v <- x[vx <- order(rank(x) + runif(N, -D, D))]
z <- x[order(x)]
layout(matrix(c(1,3,2,3), nrow=2))
plot(v, main ="Ordered")
plot(z, main ="Mixed")
plot(v, z, xlab="ordered", ylab="mixed"); abline(0,1)
I don't think I have completely understood your question but here's a start. I am simply recursively swapping random consecutive values of the sorted vector. You can control the amount of swapping with n_swaps argument. -
noisy_sort <- function(x, n_swaps) {
sorted_x <- sort(x)
indices <- sample(seq_along(x[-1]), n_swaps)
for(i in indices) {
sorted_x[c(i, i+1)] <- sorted_x[c(i+1, i)]
}
sorted_x
}
set.seed(1)
x <- runif(20)
result <- noisy_sort(x, 3)
order(result)
[1] 1 2 3 5 4 6 7 8 9 10 11 13 12 14 15 16 17 19 18 20
^ ^ ^ ^ ^ ^
Here is a very rudimentary algo.
Using Spearman correlation for distinct ranks, you can back out the desired sum of squared difference (SSE) between ranks. Then, using a Markov Chain Monte Carlo (MCMC) approach, you sample a pair of indices to swap and transit to the new vector with swapped elements if it improves the SSE towards desired score.
I used the number of iterations as the stopping criteria. You can change the condition so that it meets a target tolerance level.
set.seed(1)
n <- 20
x <- runif(n)
v <- sort(x)
calc_exp_sse <- function(rho, N) {
(1 - rho) * N * (N^2 - 1) / 6
}
exp_sse <- calc_exp_sse(0.5, n)
ord <- 1:n
vec <- ord
for (i in 1:1000) {
swap <- vec
swid <- sample(n, 2L)
swap[swid] <- swap[c(swid[2L], swid[1L])]
if (abs(exp_sse - sum((ord-swap)^2)) < abs(exp_sse - sum((ord-vec)^2))) {
vec <- swap
}
}
vec
cor(vec, ord, method="spearman")
#[1] 0.5007519
cor(v, v[vec], method="spearman")
#[1] 0.5007519
Related
I am dealing with a variation of the well-known subset sum problem which I am really in need of some help with. In my problem, I have a matrix m with two columns(a, b) and n rows. I want to find the rows that the sum of the corresponding a and b values equal two target values (a_target, b_target). Some constraints are that a_target, b_target, a,b are all whole positive integers and I am only interested in the first solution that meets the criteria of the two targets being returned or, if no solution meets the criteria, the closest. This closest can be defined as the sum of the error across the two targets. As this method will be run on large datasets, I would need the solution to be optimised.
The problem could be set up as follows:
m <- matrix(data=sample(1:100, 200, replace=T),
ncol=2,
dimnames = list(
NULL,c("a","b")
))
head(m)
a b
[1,] 44 80
[2,] 51 24
[3,] 31 68
[4,] 46 55
[5,] 34 98
[6,] 93 49
a_target <- 500
b_target <- 700
To give some background, the ordinary subset sum problem deals with finding any subset of a set of integers that sums to some target t, which is NP-complete. There are multiple methods to do this with varying time optimisations. One such package in R is subsetsum, documentation. I have taken code form this package with an aim to modify it for use in my problem but I'm not sure if it is possible, for instance this solution requires t to be in increasing order to work so I'm not sure how applicable that would be with two t values. The code where t is a single column i.e. a vector is:
subsetsum <- function(S, t) {
n <- length(S)
inds <- NULL
x <- logical(n)
F <- numeric(t + 1)
G <- logical(t + 1)
G[1] <- TRUE
print(paste("n,inds,x,F,G",n,inds,x,F,G))
for (k in 1:n) {
H <- c(logical(S[k]), G[1:(t + 1 - S[k])])
H <- (G < H)
j <- which(H)
F[j] <- k
G[j] <- TRUE
if (G[t + 1]) break
}
wch <- which(G)
j <- wch[length(wch)]
fmax <- j - 1
while (j > 1) {
k <- F[j]
x[k] <- TRUE
j <- j - S[k]
}
inds <- which(x)
return(list(val = sum(S[inds]), inds = inds))
}
I would like to randomly assign positive integers to G groups, such that they sum up to V.
For example, if G = 3 and V = 21, valid results may be (7, 7, 7), (10, 6, 5), etc.
Is there a straightforward way to do this?
Editor's notice (from 李哲源):
If values are not restricted to integers, the problem is simple and has been addressed in Choosing n numbers with fixed sum.
For integers, there is a previous Q & A: Generate N random integers that sum to M in R but it appears more complicated and is hard to follow. The loop based solution over there is also not satisfying.
non-negative integers
Let n be sample size:
x <- rmultinom(n, V, rep.int(1 / G, G))
is a G x n matrix, where each column is a multinomial sample that sums up to V.
By passing rep.int(1 / G, G) to argument prob I assume that each group has equal probability of "success".
positive integers
As Gregor mentions, a multinomial sample can contain 0. If such samples are undesired, they should be rejected. As a result, we sample from a truncated multinomial distribution.
In How to generate target number of samples from a distribution under a rejection criterion I suggested an "over-sampling" approach to achieve "vectorization" for a truncated sampling. Simply put, Knowing the acceptance probability we can estimate the expected number of trials M to see the first "success" (non-zero). We first sample say 1.25 * M samples, then there will be at least one "success" in these samples. We randomly return one as the output.
The following function implements this idea to generate truncated multinomial samples without 0.
positive_rmultinom <- function (n, V, prob) {
## input validation
G <- length(prob)
if (G > V) stop("'G > V' causes 0 in a sample for sure!")
if (any(prob < 0)) stop("'prob' can not contain negative values!")
## normalization
sum_prob <- sum(prob)
if (sum_prob != 1) prob <- prob / sum_prob
## minimal probability
min_prob <- min(prob)
## expected number of trials to get a "success" on the group with min_prob
M <- round(1.25 * 1 / min_prob)
## sampling
N <- n * M
x <- rmultinom(N, V, prob)
keep <- which(colSums(x == 0) == 0)
x[, sample(keep, n)]
}
Now let's try
V <- 76
prob <- c(53, 13, 9, 1)
Directly using rmultinom to draw samples can occasionally result in ones with 0:
## number of samples that contain 0 in 1000 trials
sum(colSums(rmultinom(1000, V, prob) == 0) > 0)
#[1] 355 ## or some other value greater than 0
But there is no such issue by using positive_rmultinom:
## number of samples that contain 0 in 1000 trials
sum(colSums(positive_rmultinom(1000, V, prob) == 0) > 0)
#[1] 0
Probably a less expensive way, but this seems to work.
G <- 3
V <- 21
m <- data.frame(matrix(rep(1:V,G),V,G))
tmp <- expand.grid(m) # all possibilities
out <- tmp[which(rowSums(tmp) == V),] # pluck those that sum to 'V'
out[sample(1:nrow(out),1),] # randomly select a column
Not sure how to do with runif
I figured out what I believe to be a much simpler solution. You first generate random integers from your minimum to maximum range, count them up and then make a vector of the counts (including zeros).
Note that this solution may include zeros even if the minimum value is greater than zero.
Hope this helps future r people with this problem :)
rand.vect.with.total <- function(min, max, total) {
# generate random numbers
x <- sample(min:max, total, replace=TRUE)
# count numbers
sum.x <- table(x)
# convert count to index position
out = vector()
for (i in 1:length(min:max)) {
out[i] <- sum.x[as.character(i)]
}
out[is.na(out)] <- 0
return(out)
}
rand.vect.with.total(0, 3, 5)
# [1] 3 1 1 0
rand.vect.with.total(1, 5, 10)
#[1] 4 1 3 0 2
Note, I also posted this here Generate N random integers that sum to M in R, but this answer is relevant to both questions.
y <- rnorm(5)
X <- matrix(rnorm(15),5)
b <- rep(0, 3)
e <- y - X%*%b
w <- rep(0, 3)
A <- c(1,2,1)
for(i in 1:10){
for(j in 1:3){
e <- e + X[,j]*b[j]
xe <- sum(X[,j]*e)
w[j] <- xe
b[j] <- xe - (A%*%w - 1.5)
e <- e - X[,j]*b[j]
}
}
This is a dummy code that generate vector b. The problem here is that one of the vector in the loop is w which depends all the terms in b:
This w vector effects b as a vector not term by term. But the terms in w also depend on b[j]'s.
In the inner loop w vector is
1.78787 0.00000 0.00000 # first iteration
1.787870 -1.231099 0.000000 # second iteration
1.787870 -1.231099 7.507026 # Third iteration
respectively, for the first iteration over i.
According to theory, w should be in the form of third iteration not in the forms with zeros. This means for b[1], w has three zeros, for b[2], w has two zeros etc, but I need to use the same w for all b[j]'s.
I hope, I explained it properly. Maybe this is simple, but I'm confused.
What is the best way to code this situation: The iterations to estimate b depend all the terms in b?
A summary for math behind this is as following. b[j]'s are iterated until convergence.
We have a big for loop in R for simulating various data where for some iterations the data generate in such a way that a quantity comes 0 inside the loop, which is not desirable and we should skip that step of data generation. But at the same time we also need to increase the number of iterations by one step because of such skip, otherwise we will have fewer observations than required.
For example, while running the following code, we get z=0 in iteration 1, 8 and 9.
rm(list=ls())
n <- 10
z <- NULL
for(i in 1:n){
set.seed(i)
a <- rbinom(1,1,0.5)
b <- rbinom(1,1,0.5)
z[i] <- a+b
}
z
[1] 0 1 1 1 1 2 1 0 0 1
We desire to skip these steps so that we do not have any z=0 but we also want a vector z of length 10. It may be done in many ways. But what I particularly want to see is how we can stop the iteration and skip the current step when z=0 is encountered and go to the next step, ultimately obtaining 10 observations for z.
Normally we do this via a while loop, as the number of iterations required is unknown beforehand.
n <- 10L
z <- integer(n)
m <- 1L; i <- 0L
while (m <= n) {
set.seed(i)
z_i <- sum(rbinom(2L, 1, 0.5))
if (z_i > 0L) {z[m] <- z_i; m <- m + 1L}
i <- i + 1L
}
Output:
z
# [1] 1 1 1 1 1 2 1 1 1 1
i
# [1] 14
So we sample 14 times, 4 of which are 0 and the rest 10 are retained.
More efficient vectorized method
set.seed(0)
n <- 10L
z <- rbinom(n, 1, 0.5) + rbinom(n, 1, 0.5)
m <- length(z <- z[z > 0L]) ## filtered samples
p <- m / n ## estimated success probability
k <- round(1.5 * (n - m) / p) ## further number of samples to ensure successful (n - m) non-zero samples
z_more <- rbinom(k, 1, 0.5) + rbinom(k, 1, 0.5)
z <- c(z, z_more[which(z_more > 0)[seq_len(n - m)]])
Some probability theory of geometric distribution has been used here. Initially we sample n samples, m of which are retained. So the estimated probability of success in accepting samples is p <- m/n. According to theory of Geometric distribution, on average, we need at least 1/p samples to observe a success. Therefore, we should at least sample (n-m)/p more times to expect (n-m) success. The 1.5 is just an inflation factor. By sampling 1.5 times more samples we hopefully can ensure (n-m) success.
According to Law of large numbers, the estimate of p is more precise when n is large. Therefore, this approach is stable for large n.
If you feel that 1.5 is not large enough, use 2 or 3. But my feeling is that it is sufficient.
I am trying to generate n random numbers whose sum is less than 1.
So I can't just run runif(3). But I can condition each iteration on the sum of all values generated up to that point.
The idea is to start an empty vector, v, and set up a loop such that for each iteration, i, a runif() is generated, but before it is accepted as an element of v, i.e. v[i] <- runif(), the test sum(v) < 1 is carried out, and while FALSE the last entry v[i] is finally accepted, BUT if TRUE, that is the sum is greater than 1, v[i] is tossed out of the vector, and the iteration i is repeated.
I am far from implementing this idea, but I would like to resolve it along the lines of something similar to what follows. It's not so much a practical problem, but more of an exercise to understand the syntax of loops in general:
n <- 4
v <- 0
for (i in 1:n){
rdom <- runif(1)
if((sum(v) + rdom) < 1) v[i] <- rdom
}
# keep trying before moving on to iteration i + 1???? i <- stays i?????
}
I have looked into while (actually I incorporated the while function in the title); however, I need the vector to have n elements, so I get stuck if I try something that basically tells R to add random uniform realizations as elements of the vector v while sum(v) < 1, because I can end up with less than n elements in v.
Here's a possible solution. It doesn't use while but the more generic repeat. I edited it to use a while and save a couple of lines.
set.seed(0)
n <- 4
v <- numeric(n)
i <- 0
while (i < n) {
ith <- runif(1)
if (sum(c(v, ith)) < 1) {
i <- i+1
v[i] <- ith
}
}
v
# [1] 0.89669720 0.06178627 0.01339033 0.02333120
Using a repeat block, you must check for the condition anyways, but, removing the growing problem, it would look very similar:
set.seed(0)
n <- 4
v <- numeric(n)
i <- 0
repeat {
ith <- runif(1)
if (sum(c(v, ith)) < 1) {
i <- i+1
v[i] <- ith
}
if (i == 4) break
}
If you really want to keep exactly the same procedure that you have posted (aka iteratively sample the n values one at a time from the standard uniform distribution, rejecting any samples that cause your sum to exceed 1), then the following code is mathematically equivalent, shorter, and more efficient:
samp <- function(n) {
v <- rep(0, n)
for (i in 1:n) {
v[i] <- runif(1, 0, 1-sum(v))
}
v
}
Basically, this code uses the mathematical fact that if the sum of the vector is currently sum(v), then sampling from the standard uniform distribution until you get a value no greater than 1-sum(v) is exactly equivalent to sampling in the uniform distribution from 0 to 1-sum(v). The advantage of using the latter approach is that it's much more efficient -- we don't need to keep rejecting samples and trying again, and can instead just sample once for each element.
To get a sense of the runtime differences, consider sampling 100 observations with n=10, comparing to a working implementation of the code from your post (copied from my other answer to this question):
OP <- function(n) {
v <- rep(0, n)
for (i in 1:n){
rdom <- runif(1)
while (sum(v) + rdom > 1) rdom <- runif(1)
v[i] <- rdom
}
v
}
set.seed(144)
system.time(samples.OP <- replicate(100, OP(10)))
# user system elapsed
# 261.937 1.641 265.805
system.time(samples.josliber <- replicate(100, samp(10)))
# user system elapsed
# 0.004 0.001 0.004
In this case, the new approach is approaching 100,000 times faster.
It sounds like you're trying to uniformly sample from a space of n variables where the following constraints hold:
x_1 + x_2 + ... + x_n <= 1
x_1 >= 0
x_2 >= 0
...
x_n >= 0
The "hit and run" algorithm is the mathematical machinery that enables you to do exactly this. In 2-dimensional space, the algorithm will sample uniformly from the following triangle, with each location in the shaded area being equally likely to be selected:
The algorithm is provided in R through the hitandrun package, which requires you to specify the linear inequalities that define the space through a constraint matrix, direction vector, and right-hand side vector:
library(hitandrun)
n <- 3
constr <- list(constr = rbind(rep(1, n), -diag(n)),
dir = c(rep("<=", n+1)),
rhs = c(1, rep(0, n)))
set.seed(144)
samples <- hitandrun(constr, n.samples=1000)
head(samples, 10)
# [,1] [,2] [,3]
# [1,] 0.28914690 0.01620488 0.42663224
# [2,] 0.65489979 0.28455231 0.00199671
# [3,] 0.23215115 0.00661661 0.63597912
# [4,] 0.29644234 0.06398131 0.60707269
# [5,] 0.58335047 0.13891392 0.06151205
# [6,] 0.09442808 0.30287832 0.55118290
# [7,] 0.51462261 0.44094683 0.02641638
# [8,] 0.38847794 0.15501252 0.31572793
# [9,] 0.52155055 0.09921046 0.13304728
# [10,] 0.70503030 0.03770875 0.14299089
Breaking down this code a bit, we generated the following constraint matrix:
constr
# $constr
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] -1 0 0
# [3,] 0 -1 0
# [4,] 0 0 -1
#
# $dir
# [1] "<=" "<=" "<=" "<="
#
# $rhs
# [1] 1 0 0 0
Reading across the first line of constr$constr we have 1, 1, 1 which indicates "1*x1 + 1*x2 + 1*x3". The first element of constr$dir is <=, and the first element of constr$rhs is 1; putting it together we have x1 + x2 + x3 <= 1. From the second row of constr$constr we read -1, 0, 0 which indicates "-1*x1 + 0*x2 + 0*x3". The second element of constr$dir is <= and the second element of constr$rhs is 0; putting it together we have -x1 <= 0 which is the same as saying x1 >= 0. The similar non-negativity constraints follow in the remaining rows.
Note that the hit and run algorithm has the nice property of having the exact same distribution for each of the variables:
hist(samples[,1])
hist(samples[,2])
hist(samples[,3])
Meanwhile, the distribution of the samples from your procedure will be highly uneven, and as n increases this problem will get worse and worse.
OP <- function(n) {
v <- rep(0, n)
for (i in 1:n){
rdom <- runif(1)
while (sum(v) + rdom > 1) rdom <- runif(1)
v[i] <- rdom
}
v
}
samples.OP <- t(replicate(1000, OP(3)))
hist(samples.OP[,1])
hist(samples.OP[,2])
hist(samples.OP[,3])
An added advantage is that the hit-and-run algorithm appears faster -- I generated these 1000 replicates in 0.006 seconds on my computer with hit-and-run and it took 0.3 seconds using the modified code from the OP.
Here's how I would do it, without any loop, if or while:
set.seed(123)
x <- runif(1) # start with the sum that you want to obtain
n <- 4 # number of generated random numbers, can be chosen arbitrarily
y <- sort(runif(n-1,0,x)) # choose n-1 random points to cut the range [0:x]
z <- c(y[1],diff(y),x-y[n-1]) # result: determine the length of the segments
#> z
#[1] 0.11761257 0.10908627 0.02723712 0.03364156
#> sum(z)
#[1] 0.2875775
#> all.equal(sum(z),x)
#[1] TRUE
The advantage here is that you can determine exactly which sum you want to obtain and how many numbers n you want to generate for this. If you set, e.g., x <- 1 in the second line, the n random numbers stored in the vector z will add up to one.