I have a vector A which contains zeros and ones. I would like to randomly change n percent of the ones to zero. Is this the best way to do it in R (10% change):
for (i in 1:length(A))
{
if(A[i] > 0)
{
if(runif(1) <= 0.1)
{
A[i] = 0
}
}
}
Thanks.
You can do this without using the for loops and if statements:
##Generate some data
R> A = sample(0:1, 100, replace=TRUE)
##Generate n U(0,1) random numbers
##If any of the U's are less then 0.1
##Set the corresponding value in A to 0
R> A[runif(length(A)) < 0.1] = 0
The other point to note, is that you don't have to do anything special for values of A that actually equal 0, as the probability of change a 1 to a 0 is still 0.1.
As Hadley points out, your code doesn't randomly change 10% of 1's to 0. If that is really your intention, then:
##Select the rows in A equal to 1
R> rows_with_1 = (1:length(A))[A==1]
##Randomly select a % of these rows and set equal to zero
##Warning: there will likely be some rounding here
R> A[sample(rows_with_1, length(rows_with_1)*0.1)] = 0
If this is your A:
A <- round(rnorm(100, 0.5, 0.1))
This should do it:
n <- 10
A[sample(A[A==1], length(A[A==1])*n/100)] <- 0
where n is the percentage of your 1s that you want to change to 0s.
You can vectorize that:
A <- round(runif(20), 0)
A[sample(which(A == 1), 0.1 * length(A == 1))] <- 0
HTH
Related
I would like to randomly assign positive integers to G groups, such that they sum up to V.
For example, if G = 3 and V = 21, valid results may be (7, 7, 7), (10, 6, 5), etc.
Is there a straightforward way to do this?
Editor's notice (from 李哲源):
If values are not restricted to integers, the problem is simple and has been addressed in Choosing n numbers with fixed sum.
For integers, there is a previous Q & A: Generate N random integers that sum to M in R but it appears more complicated and is hard to follow. The loop based solution over there is also not satisfying.
non-negative integers
Let n be sample size:
x <- rmultinom(n, V, rep.int(1 / G, G))
is a G x n matrix, where each column is a multinomial sample that sums up to V.
By passing rep.int(1 / G, G) to argument prob I assume that each group has equal probability of "success".
positive integers
As Gregor mentions, a multinomial sample can contain 0. If such samples are undesired, they should be rejected. As a result, we sample from a truncated multinomial distribution.
In How to generate target number of samples from a distribution under a rejection criterion I suggested an "over-sampling" approach to achieve "vectorization" for a truncated sampling. Simply put, Knowing the acceptance probability we can estimate the expected number of trials M to see the first "success" (non-zero). We first sample say 1.25 * M samples, then there will be at least one "success" in these samples. We randomly return one as the output.
The following function implements this idea to generate truncated multinomial samples without 0.
positive_rmultinom <- function (n, V, prob) {
## input validation
G <- length(prob)
if (G > V) stop("'G > V' causes 0 in a sample for sure!")
if (any(prob < 0)) stop("'prob' can not contain negative values!")
## normalization
sum_prob <- sum(prob)
if (sum_prob != 1) prob <- prob / sum_prob
## minimal probability
min_prob <- min(prob)
## expected number of trials to get a "success" on the group with min_prob
M <- round(1.25 * 1 / min_prob)
## sampling
N <- n * M
x <- rmultinom(N, V, prob)
keep <- which(colSums(x == 0) == 0)
x[, sample(keep, n)]
}
Now let's try
V <- 76
prob <- c(53, 13, 9, 1)
Directly using rmultinom to draw samples can occasionally result in ones with 0:
## number of samples that contain 0 in 1000 trials
sum(colSums(rmultinom(1000, V, prob) == 0) > 0)
#[1] 355 ## or some other value greater than 0
But there is no such issue by using positive_rmultinom:
## number of samples that contain 0 in 1000 trials
sum(colSums(positive_rmultinom(1000, V, prob) == 0) > 0)
#[1] 0
Probably a less expensive way, but this seems to work.
G <- 3
V <- 21
m <- data.frame(matrix(rep(1:V,G),V,G))
tmp <- expand.grid(m) # all possibilities
out <- tmp[which(rowSums(tmp) == V),] # pluck those that sum to 'V'
out[sample(1:nrow(out),1),] # randomly select a column
Not sure how to do with runif
I figured out what I believe to be a much simpler solution. You first generate random integers from your minimum to maximum range, count them up and then make a vector of the counts (including zeros).
Note that this solution may include zeros even if the minimum value is greater than zero.
Hope this helps future r people with this problem :)
rand.vect.with.total <- function(min, max, total) {
# generate random numbers
x <- sample(min:max, total, replace=TRUE)
# count numbers
sum.x <- table(x)
# convert count to index position
out = vector()
for (i in 1:length(min:max)) {
out[i] <- sum.x[as.character(i)]
}
out[is.na(out)] <- 0
return(out)
}
rand.vect.with.total(0, 3, 5)
# [1] 3 1 1 0
rand.vect.with.total(1, 5, 10)
#[1] 4 1 3 0 2
Note, I also posted this here Generate N random integers that sum to M in R, but this answer is relevant to both questions.
I would like to generate N random positive integers that sum to M. I would like the random positive integers to be selected around a fairly normal distribution whose mean is M/N, with a small standard deviation (is it possible to set this as a constraint?).
Finally, how would you generalize the answer to generate N random positive numbers (not just integers)?
I found other relevant questions, but couldn't determine how to apply their answers to this context:
https://stats.stackexchange.com/questions/59096/generate-three-random-numbers-that-sum-to-1-in-r
Generate 3 random number that sum to 1 in R
R - random approximate normal distribution of integers with predefined total
Normalize.
rand_vect <- function(N, M, sd = 1, pos.only = TRUE) {
vec <- rnorm(N, M/N, sd)
if (abs(sum(vec)) < 0.01) vec <- vec + 1
vec <- round(vec / sum(vec) * M)
deviation <- M - sum(vec)
for (. in seq_len(abs(deviation))) {
vec[i] <- vec[i <- sample(N, 1)] + sign(deviation)
}
if (pos.only) while (any(vec < 0)) {
negs <- vec < 0
pos <- vec > 0
vec[negs][i] <- vec[negs][i <- sample(sum(negs), 1)] + 1
vec[pos][i] <- vec[pos ][i <- sample(sum(pos ), 1)] - 1
}
vec
}
For a continuous version, simply use:
rand_vect_cont <- function(N, M, sd = 1) {
vec <- rnorm(N, M/N, sd)
vec / sum(vec) * M
}
Examples
rand_vect(3, 50)
# [1] 17 16 17
rand_vect(10, 10, pos.only = FALSE)
# [1] 0 2 3 2 0 0 -1 2 1 1
rand_vect(10, 5, pos.only = TRUE)
# [1] 0 0 0 0 2 0 0 1 2 0
rand_vect_cont(3, 10)
# [1] 2.832636 3.722558 3.444806
rand_vect(10, -1, pos.only = FALSE)
# [1] -1 -1 1 -2 2 1 1 0 -1 -1
Just came up with an algorithm to generate N random numbers greater or equal to k whose sum is S, in an uniformly distributed manner. I hope it will be of use here!
First, generate N-1 random numbers between k and S - k(N-1), inclusive. Sort them in descending order. Then, for all xi, with i <= N-2, apply x'i = xi - xi+1 + k, and x'N-1 = xN-1 (use two buffers). The Nth number is just S minus the sum of all the obtained quantities. This has the advantage of giving the same probability for all the possible combinations. If you want positive integers, k = 0 (or maybe 1?). If you want reals, use the same method with a continuous RNG. If your numbers are to be integer, you may care about whether they can or can't be equal to k. Best wishes!
Explanation: by taking out one of the numbers, all the combinations of values which allow a valid Nth number form a simplex when represented in (N-1)-space, which lies at one vertex of a (N-1)-cube (the (N-1)-cube described by the random values range). After generating them, we have to map all points in the N-cube to points in the simplex. For that purpose, I have used one method of triangulation which involves all possible permutations of coordinates in descending order. By sorting the values, we are mapping all (N-1)! simplices to only one of them. We also have to translate and scale the numbers vector so that all coordinates lie in [0, 1], by subtracting k and dividing the result by S - kN. Let us name the new coordinates yi.
Then we apply the transformation by multiplying the inverse matrix of the original basis, something like this:
/ 1 1 1 \ / 1 -1 0 \
B = | 0 1 1 |, B^-1 = | 0 1 -1 |, Y' = B^-1 Y
\ 0 0 1 / \ 0 0 1 /
Which gives y'i = yi - yi+1. When we rescale the coordinates, we get:
x'i = y'i(S - kN) + k = yi(S - kN) - yi+1(S - kN) + k = (xi - k) - (xi+1 - k) + k = xi - xi+1 + k, hence the above formula. This is applied to all elements except the last one.
Finally, we should take into account the distortion that this transformation introduces into the probability distribution. Actually, and please correct me if I'm wrong, the transformation applied to the first simplex to obtain the second should not alter the probability distribution. Here is the proof.
The probability increase at any point is the increase in the volume of a local region around that point as the size of the region tends to zero, divided by the total volume increase of the simplex. In this case, the two volumes are the same (just take the determinants of the basis vectors). The probability distribution will be the same if the linear increase of the region volume is always equal to 1. We can calculate it as the determinant of the transpose matrix of the derivative of a transformed vector V' = B-1 V with respect to V, which, of course, is B-1.
Calculation of this determinant is quite straightforward, and it gives 1, which means that the points are not distorted in any way that would make some of them more likely to appear than others.
I figured out what I believe to be a much simpler solution. You first generate random integers from your minimum to maximum range, count them up and then make a vector of the counts (including zeros).
Note that this solution may include zeros even if the minimum value is greater than zero.
Hope this helps future r people with this problem :)
rand.vect.with.total <- function(min, max, total) {
# generate random numbers
x <- sample(min:max, total, replace=TRUE)
# count numbers
sum.x <- table(x)
# convert count to index position
out = vector()
for (i in 1:length(min:max)) {
out[i] <- sum.x[as.character(i)]
}
out[is.na(out)] <- 0
return(out)
}
rand.vect.with.total(0, 3, 5)
# [1] 3 1 1 0
rand.vect.with.total(1, 5, 10)
#[1] 4 1 3 0 2
a) Create a vector X of length 20, with the kth element in X = 2k, for k=1…20. Print out the values of X.
b) Create a vector Y of length 20, with all elements in Y equal to 0. Print out the values of Y.
c) Using a for loop, reassigns the value of the k-th element in Y, for k = 1…20. When k < 12, the kth element of Y is reassigned as the cosine of k. When the k ≥ 12, the kth element of Y is reassigned as the value of integral sqrt(t)dt from 0 to K.
for the first two questions, it is simple.
> x1 <- seq(1,20,by=2)
> x <- 2 * x1
> x
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
> y <- rep(0,20)
> y
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i got stuck on the last one,
t <- function(i) sqrt(i)
for (i in 1:20) {
if (i < 12) {
y[i] <- cos(i)
}
else if (i >= 12) {
y[i] <- integral(t, lower= 0, Upper = 20)
}
}
y // print new y
Any suggestions? thanks.
What may help is that the command to calculate a one-dimensional integral is integrate not integral.
You have successfully completed the first two, so I'll demonstrate a different way of getting those vectors:
x <- 2 * seq_len(20)
y <- double(length = 20)
As for your function, you have the right idea, but you need to clean up your syntax a bit. For example, you may need to double-check your braces (using a set style like Hadley Wickham's will help you prevent syntax errors and make the code more readable), you don't need the "if" in the else, you need to read up on integrate and see what its inputs, and importantly its outputs are (and which of them you need and how to extract it), and lastly, you need to return a value from your function. Hopefully, that's enough to help you work it out on your own. Good Luck!
Update
Slightly different function to demonstrate coding style and some best practices with loops
Given a working answer has been posted, this is what I did when looking at your question. I think it is worth posting, as as I think that it is a good habit to 1) pre-allocate answers 2) prevent confusion about scope by not re-using the input variable name as an output and 3) use the seq_len and seq_along constructions for for loops, per R Inferno(pdf) which is required reading, in my opinion:
tf <- function(y){
z <- double(length = length(y))
for (k in seq_along(y)) {
if (k < 12) {
z[k] <- cos(k)
} else {
z[k] <- integrate(f = sqrt, lower = 0, upper = k)$value
}
}
return(z)
}
Which returns:
> tf(y)
[1] 0.540302306 -0.416146837 -0.989992497 -0.653643621 0.283662185 0.960170287 0.753902254
[8] -0.145500034 -0.911130262 -0.839071529 0.004425698 27.712816032 31.248114562 34.922139530
[15] 38.729837810 42.666671456 46.728535669 50.911693960 55.212726149 59.628486093
To be honest you almost have it ready and it is good that you have showed some code here:
y <- rep(0,20) #y vector from question 2
for ( k in 1:20) { #start the loop
if (k < 12) { #if k less than 12
y[k] <- cos(k) #calculate cosine
} else if( k >= 12) { #else if k greater or equal to 12
y[k] <- integrate( sqrt, lower=0, upper=k)$value #see below for explanation
}
}
print(y) #prints y
> print(y)
[1] 0.540302306 -0.416146837 -0.989992497 -0.653643621 0.283662185 0.960170287 0.753902254 -0.145500034 -0.911130262 -0.839071529 0.004425698
[12] 27.712816032 31.248114562 34.922139530 38.729837810 42.666671456 46.728535669 50.911693960 55.212726149 59.628486093
First of all stats::integrate is the function you need to calculate the integral
integrate( sqrt, lower=0, upper=2)$value
The first argument is a function which in your case is sqrt. sqrt is defined already in R so there is no need to define it yourself explicitly as t <- function(i) sqrt(i)
The other two arguments as you correctly set in your code are lower and upper.
The function integrate( sqrt, lower=0, upper=2) will return:
1.885618 with absolute error < 0.00022
and that is why you need integrate( sqrt, lower=0, upper=2)$value to only extract the value.
Type ?integrate in your console to see the documentation which will help you a lot I think.
I would like to generate N random positive integers that sum to M. I would like the random positive integers to be selected around a fairly normal distribution whose mean is M/N, with a small standard deviation (is it possible to set this as a constraint?).
Finally, how would you generalize the answer to generate N random positive numbers (not just integers)?
I found other relevant questions, but couldn't determine how to apply their answers to this context:
https://stats.stackexchange.com/questions/59096/generate-three-random-numbers-that-sum-to-1-in-r
Generate 3 random number that sum to 1 in R
R - random approximate normal distribution of integers with predefined total
Normalize.
rand_vect <- function(N, M, sd = 1, pos.only = TRUE) {
vec <- rnorm(N, M/N, sd)
if (abs(sum(vec)) < 0.01) vec <- vec + 1
vec <- round(vec / sum(vec) * M)
deviation <- M - sum(vec)
for (. in seq_len(abs(deviation))) {
vec[i] <- vec[i <- sample(N, 1)] + sign(deviation)
}
if (pos.only) while (any(vec < 0)) {
negs <- vec < 0
pos <- vec > 0
vec[negs][i] <- vec[negs][i <- sample(sum(negs), 1)] + 1
vec[pos][i] <- vec[pos ][i <- sample(sum(pos ), 1)] - 1
}
vec
}
For a continuous version, simply use:
rand_vect_cont <- function(N, M, sd = 1) {
vec <- rnorm(N, M/N, sd)
vec / sum(vec) * M
}
Examples
rand_vect(3, 50)
# [1] 17 16 17
rand_vect(10, 10, pos.only = FALSE)
# [1] 0 2 3 2 0 0 -1 2 1 1
rand_vect(10, 5, pos.only = TRUE)
# [1] 0 0 0 0 2 0 0 1 2 0
rand_vect_cont(3, 10)
# [1] 2.832636 3.722558 3.444806
rand_vect(10, -1, pos.only = FALSE)
# [1] -1 -1 1 -2 2 1 1 0 -1 -1
Just came up with an algorithm to generate N random numbers greater or equal to k whose sum is S, in an uniformly distributed manner. I hope it will be of use here!
First, generate N-1 random numbers between k and S - k(N-1), inclusive. Sort them in descending order. Then, for all xi, with i <= N-2, apply x'i = xi - xi+1 + k, and x'N-1 = xN-1 (use two buffers). The Nth number is just S minus the sum of all the obtained quantities. This has the advantage of giving the same probability for all the possible combinations. If you want positive integers, k = 0 (or maybe 1?). If you want reals, use the same method with a continuous RNG. If your numbers are to be integer, you may care about whether they can or can't be equal to k. Best wishes!
Explanation: by taking out one of the numbers, all the combinations of values which allow a valid Nth number form a simplex when represented in (N-1)-space, which lies at one vertex of a (N-1)-cube (the (N-1)-cube described by the random values range). After generating them, we have to map all points in the N-cube to points in the simplex. For that purpose, I have used one method of triangulation which involves all possible permutations of coordinates in descending order. By sorting the values, we are mapping all (N-1)! simplices to only one of them. We also have to translate and scale the numbers vector so that all coordinates lie in [0, 1], by subtracting k and dividing the result by S - kN. Let us name the new coordinates yi.
Then we apply the transformation by multiplying the inverse matrix of the original basis, something like this:
/ 1 1 1 \ / 1 -1 0 \
B = | 0 1 1 |, B^-1 = | 0 1 -1 |, Y' = B^-1 Y
\ 0 0 1 / \ 0 0 1 /
Which gives y'i = yi - yi+1. When we rescale the coordinates, we get:
x'i = y'i(S - kN) + k = yi(S - kN) - yi+1(S - kN) + k = (xi - k) - (xi+1 - k) + k = xi - xi+1 + k, hence the above formula. This is applied to all elements except the last one.
Finally, we should take into account the distortion that this transformation introduces into the probability distribution. Actually, and please correct me if I'm wrong, the transformation applied to the first simplex to obtain the second should not alter the probability distribution. Here is the proof.
The probability increase at any point is the increase in the volume of a local region around that point as the size of the region tends to zero, divided by the total volume increase of the simplex. In this case, the two volumes are the same (just take the determinants of the basis vectors). The probability distribution will be the same if the linear increase of the region volume is always equal to 1. We can calculate it as the determinant of the transpose matrix of the derivative of a transformed vector V' = B-1 V with respect to V, which, of course, is B-1.
Calculation of this determinant is quite straightforward, and it gives 1, which means that the points are not distorted in any way that would make some of them more likely to appear than others.
I figured out what I believe to be a much simpler solution. You first generate random integers from your minimum to maximum range, count them up and then make a vector of the counts (including zeros).
Note that this solution may include zeros even if the minimum value is greater than zero.
Hope this helps future r people with this problem :)
rand.vect.with.total <- function(min, max, total) {
# generate random numbers
x <- sample(min:max, total, replace=TRUE)
# count numbers
sum.x <- table(x)
# convert count to index position
out = vector()
for (i in 1:length(min:max)) {
out[i] <- sum.x[as.character(i)]
}
out[is.na(out)] <- 0
return(out)
}
rand.vect.with.total(0, 3, 5)
# [1] 3 1 1 0
rand.vect.with.total(1, 5, 10)
#[1] 4 1 3 0 2
I want to write a function such that rwabovex is the sum of the values of S that are greater than 0. (My S is a random walk simulation)
Here's what I have so far but I'm not getting the right output. Can you please help?
rwabovex=function(n){
if (n <= 0) {
return(cat("n must be greater than 0"))
} else {
S=numeric(n)
S[1] = 0
above = 0
for(i in 2:n) {
step=c(1, -1)
S[i]=S[i-1]+sample(step, 1, prob = c(0.5, 0.5), replace = TRUE)
if (S[i] > 0) {
above = above + S[i]
}
print(above)
}
}
}
For example: if n=4 and the S values are -1, 2, 1, 0 then "above" should equal to 3 (since 2 and 1 are greater than 0).
Thanks!
First, you're not using vectorization to compute S, which will make the procedure slow for large n. You can vectorize using cumsum. Secondly, you can use sum to compute the sum of values in S greater than 0:
rwabovex = function(n) {
step = c(1, -1)
S = c(0, cumsum(sample(step, n-1, prob=c(.5, .5), replace=T)))
print(S)
return(sum(S[S > 0]))
}
set.seed(144)
rwabovex(10)
# [1] 0 -1 0 1 2 1 2 1 2 1
# [1] 10