I want to sample from a big pool of numbers e.g. -2000:5000.
I would like thought to set the weight for a certain number to 80%.
Everything other possibility should be equal for every other number.
Doing this for a small sample is easy:
sample(-2:2, 10, replace = TRUE, prob=c(0.05, 0.05, 0.80, 0.05, 0.05))
this would output:
[1] 0 0 0 0 0 -1 0 0 0 0
How can I do this for a big range of numbers?
Just manipulate your probability vector pragmatically. So you have
values <- -2:2
special_value <- 0
Then you can do
probs <- rep(1, length(values))
probs[values==special_value] <- (length(values )-1)*.8/(1-.8)
Then use
sample(values, 10, replace = TRUE, prob=probs)
You only need to run a separate vector with all the probabilities. Take on the account that when you create the probabilities vector, you can use a logic test to define "p1 = 0.8" if it is my desired number and "p2 = 0.2/n" in any other case. The code run as follows:
n <- 37 # your num
x <- -100:1000 # your sequence
probs <- ifelse(x == n, 0.8, (1 - 0.8) /length(x) ) # determine probabilities taking on account they have to add up to 1
sum(probs)
sample(x = x, size = 1000, prob = probs)
Let me know if it helps.
Related
I have the following vector in R: c(0,1).
I am wishing to randomly sample from this vector 10 elements at a time, but such that no more than 2 elements repeat.
The code I have tried is sample(c(0,1),10,replace=T)
But I would like to get
sample(c(0,1),10,replace=T) = (0,1,1,0,1,1,0,0,1,0)
sample(z,4,replace=T) = (0,1,0,1,0,0,1,0,1,0)
but not
sample(z,4,replace=T) = (1,0,0,0,1,1,0,0,0)
And so on.
How could I accomplish this?
Since the number of repeats can only be 1 or 2, and since the value needs to alternate, you can achieve this in a one-liner by randomly choosing 1 or 2 repeats of each of a sequence of 1s and 0s, and truncating the result to 10 elements.
rep(rep(0:1, 5), times = sample(c(1:2), 10, TRUE))[1:10]
#> [1] 0 0 1 1 0 1 1 0 1 0
If you wish to remove the constraint of the sequence always starting with a zero, you can randomly subtract the result from 1:
abs(sample(0:1, 1) - rep(rep(0:1, 5), times = sample(c(1:2), 10, TRUE))[1:10])
#> [1] 1 1 0 0 1 0 0 1 1 0
foo <- function(){
innerfunc <- function(){sample(c(0, 1), 10, T)}
x <- innerfunc()
while(max(rle(x)$lengths) > 2){
x <- innerfunc()
}
x
}
foo()
This function will look at the max length of a sequence of zeroes and ones. If this is > 2, it reruns your sample function, named innerfunc in here.
I think this is an interesting coding practice if you would like to use recurssions, and below might be an option that gives some hints
f <- function(n) {
if (n <= 2) {
return(sample(c(0, 1), n, replace = TRUE))
}
m <- sample(c(1, 2), 1)
v <- Recall(n - m)
c(v, rep((tail(v, 1) + 1) %% 2, m))
}
I want to generate a vector of a given length, e.g., n = 5. Each value in the vector should be a proportion (i.e., a value between 0 and 1) so that across n elements they sum up to 1.
Unfortunately, I have two vectors: one (mymins) defines the allowed lower boundaries of each proportion and the other (mymaxs) defines the allowed top boundaries of each proportion.
In my example below the desired proportion for the first element is allowed to fall anywhere between 0.3 and 0.9. And for the last element, the desired proportion is allowed to fall between 0.05 and 0.7.
mymins <- c(0.3, 0.1, 0, 0.2, 0.05)
mymaxs <- c(0.9, 1, 1, 1, 0.7)
Let's assume that mymins are always 'legitimate' (i.e., their sum is never larger than 1).
How could I find a set of 5 proportions such that they all sum to 1 but lie within the boundaries?
Here is what I tried:
n = 5
mydif <- mymaxs - mymins # possible range for each proportion
myorder <- rank(mydif) # order those differences from smallest to largest
mytarget <- sum(mydif) # sum up the 5 ranges
x <- sort(runif(n))[myorder] # generate 5 random values an sort them in the order of mydif
x2 <- mymins + x / sum(x) * mytarget # rescale random values to sum up to mytarget and add them to mymins
x3 <- x2/sum(x2) # rescale x2 to sum up to 1
As you can see, I am not very far - because after rescaling some values are outside of their allowed boundaries.
I should probably also mention that I need this operation to be fast - because I am using it in an optimization loop.
I also tried to find a solution using optim, however the problem is that it always finds the same solution - and I need to generate a DIFFERENT solutions every time I find the proporotion:
myfun <- function(x) {
x <- round(x, 4)
abovemins <- x - mymins
n_belowmins <- sum(abovemins < 0)
if (n_belowmins > 0) return(100000)
belowmax <- x - mymaxs
n_abovemax <- sum(belowmax > 0)
if (n_abovemax > 0) return(100000)
mydist <- abs(sum(x) - 1)
return(mydist)
}
myopt <- optim(par = mymins + 0.01, fn = myfun)
myopt$par
sum(round(myopt$par, 4))
Thank you very much for your suggestions!
Perhaps its better to think of this in a different way. Your samples actually need to sum to 0.35 (which is 1 - sum(mymins)), then be added on to the minimum values
constrained_sample <- function(mymins, mymaxs)
{
sizes <- mymaxs - mymins
samp <- (runif(5) * sizes)
samp/sum(samp) * (1 - sum(mymins)) + mymins
}
It works like this:
constrained_sample(mymins, mymaxs)
#> [1] 0.31728333 0.17839397 0.07196067 0.29146744 0.14089459
We can test this works by running the following loop, which will print a message to the console if any of the criteria aren't met:
for(i in 1:1000)
{
test <- constrained_sample(mymins, mymaxs)
if(!all(test > mymins) | !all(test < mymaxs) | abs(sum(test) - 1) > 1e6) cat("failure")
}
This throws no errors, since the criteria are always met. However, as #GregorThomas points out, the bounds aren't realistic in this case. We can see a range of solutions constrained by your conditions using a boxplot:
samp <- constrained_sample(mymins, mymaxs)
for(i in 1:999) samp <- rbind(samp, constrained_sample(mymins, mymaxs))
df <- data.frame(val = c(samp[,1], samp[,2], samp[,3], samp[,4], samp[,5]),
index = factor(rep(1:5, each = 1000)))
ggplot(df, aes(x = index, y = val)) + geom_boxplot()
Because you need 5 random numbers to sum to 1, you really only have 4 independent numbers and one dependent number.
mymins <- c(0.3, 0.1, 0, 0.2, 0.05)
mymaxs <- c(0.9, 1, 1, 1, 0.7)
set.seed(42)
iter <- 1000
while(iter > 0 &&
(
(1 - sum(x <- runif(4, mymins[-5], mymaxs[-5]))) < mymins[5] ||
(1 - sum(x)) > mymaxs[5]
)
) iter <- iter - 1
if (iter < 1) {
# failed
stop("unable to find something within 1000 iterations")
} else {
x <- c(x, 1-sum(x))
}
sum(x)
# [1] 1
all(mymins <= x & x <= mymaxs)
# [1] TRUE
x
# [1] 0.37732330 0.21618036 0.07225311 0.24250359 0.09173965
The reason I use iter there is to make sure you don't take an "infinite" amount of time to find something. If your mymins and mymaxs combination make this mathematically infeasible (as your first example was), then you don't need to spin forever. If it is mathematically improbable to find it in a reasonable amount of time, you need to weigh how long you want to do this.
One reason this takes so long is that we are iteratively pulling entropy. If you expect this to go for a long time, then it is generally better to pre-calculate as much as you think you'll need (overall) and run things as a matrix.
set.seed(42)
n <- 10000
m <- matrix(runif(prod(n, length(mymins)-1)), nrow = n)
m <- t(t(m) * (mymaxs[-5] - mymins[-5]) + mymins[-5])
remainders <- (1 - rowSums(m))
ind <- mymins[5] <= remainders & remainders <= mymaxs[5]
table(ind)
# ind
# FALSE TRUE
# 9981 19
m <- cbind(m[ind,,drop=FALSE], remainders[ind])
nrow(m)
# [1] 19
rowSums(m)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
head(m)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.3405821 0.1306152 0.05931363 0.2199362 0.24955282
# [2,] 0.3601376 0.1367465 0.20235704 0.2477507 0.05300821
# [3,] 0.4469526 0.1279795 0.02265618 0.2881733 0.11423845
# [4,] 0.5450527 0.1029903 0.07503371 0.2052423 0.07168103
# [5,] 0.3161519 0.1469783 0.15290720 0.3268470 0.05711557
# [6,] 0.4782448 0.1185735 0.01664063 0.2178225 0.16871845
all(
mymins[1] <= m[,1] & m[,1] <= mymaxs[1],
mymins[2] <= m[,2] & m[,2] <= mymaxs[2],
mymins[3] <= m[,3] & m[,3] <= mymaxs[3],
mymins[4] <= m[,4] & m[,4] <= mymaxs[4],
mymins[5] <= m[,5] & m[,5] <= mymaxs[5]
)
# [1] TRUE
This time it took 10000 attempts to make 19 valid combinations. It might take more or fewer attempts based on randomness, so ymmv with regards to how much you need to pre-generate.
If your example bounds are realistic, we can refine them quite a bit, narrowing the range of possibilities. For the current version of the question with:
mymins = c(0.3, 0.1, 0, 0.2, 0.05)
mymaxs = c(0.9, 1, 1, 1, 0.7)
What's the max for x[1]? Well, if x[2:5] take on minimum values, they will add up to 0.1 + 0 + 0.2 + 0.05 = 0.35, so based on the other mins only we know that max value for x[1] is 1 - 0.35 = 0.65. The 0.9 in mymaxs is way too high.
We can calculate the actual max values taking the minimum of the max values based on the minimums and the mymaxs vector:
new_max = pmin(mymaxs, 1 - (sum(mymins) - mymins))
new_max
# [1] 0.65 0.45 0.35 0.55 0.40
We can similarly revise the min bounds, though in this case even the revised max bounds new_max are high enough that it would have any impact on the minimums.
new_min = pmax(mymins, 1 - (sum(new_max) - new_max))
new_min
# [1] 0.30 0.10 0.00 0.20 0.05
With these adjustments, we should be able to see easily if any solutions are possible (all(new_min < new_max)). And then generating random numbers as in r2evans's answer should go much quicker using the new bounds.
I have a matrix composed of sites and species. Some species have a certain trait value but not all of them.
I want to keep only the site-species matrix rows that contain enough trait information, in my case more than 60%.
So far, I have the following for-loop but I would like to have a faster version of this code. How can I optimize this and skip the for-loop part?
# site-species matrix
A <- matrix(c(0, 0.2, 0.2, 0.6, 0.3, 0.3, 0, 0.4), byrow = T, nrow = 2)
colnames(A) <- paste0("sp_", seq(ncol(A)))
rownames(A) <- paste0("site_", seq(nrow(A)))
# trait information
B <- data.frame(sp = paste0("sp_", seq(1:ncol(A))),
value = c(NA, NA, 2, 3))
# For-loop to get the coverage percentage for each row
pcover <- c()
for(i in 1:nrow(A)){
non_null_A <- A[i, ][A[i, ] > 0]
B_match <- match(names(non_null_A), B[, "sp"])
B_value <- B[B_match, "value"]
pcover <- rbind(pcover,
sum(!is.na(B_value)) / length(B_value) * 100)
}
A
A[pcover > 60, , drop = FALSE] # in this case, the second site is removed
The idea is that you have two conditions working together:
is A positive ?
is B$value NA ?
We compute these tests from the start and use only vectorized code :
Apos <- A[,B$sp] > 0 # or just A>0 here but I assumed from your code that you'd needed this
pcover <- 100* colSums(t(Apos) & !is.na(B$value)) /rowSums(Apos)
pcover
# site_1 site_2
# 66.66667 33.33333
A[pcover > 60, , drop = FALSE]
# sp_1 sp_2 sp_3 sp_4
# site_1 0 0.2 0.2 0.6
I am trying to draw from two different distributions with a probability 100000 times. Unfortunately I can't see what is wrong with my for loop, however, it only adds 1 value to simulated_data instead of the desired 100,000 values.
Question 1: How can I fix this?
Question 2: Is there a far more efficient method where I don't have to loop through 100,000 items in a list?
#creating a vector of probabilities
probabilities <- rep(0.99,100000)
#creating a vector of booleans
logicals <- runif(length(probabilities)) < probabilities
#empty list for my simulated data
simulated_data <- c()
#drawing from two different distributions depending on the value in logicals
for(i in logicals){
if (isTRUE(i)) {
simulated_data[i] <- rnorm(n = 1, mean = 0, sd = 1)
}else{
simulated_data[i] <- rnorm(n = 1, mean = 0, sd = 10)
}
}
It seems that you want to create a final sample where each element is taken randomly from either sample1 or sample2, with probabilities 0.99 and 0.01.
The correct approach would be to generate both samples, each containing the same number of elements and then select randomly from either one.
The correct approach would be:
# Generate both samples
n = 100000
sample1 = rnorm(n,0,1)
sample2 = rnorm(n,0,10)
# Create the logical vector that will decide whether to take from sample 1 or 2
s1_s2 = runif(n) < 0.99
# Create the final sample
sample = ifelse(s1_s2 , sample1, sample2)
In this case, it is not guaranteed that there are exactly 0.99*n samples from sample1 and 0.01*n from sample2. In fact:
> sum(sample == sample1)
[1] 98953
This is close to 0.99*n, as expected, but not exactly.
Create a vector with the desired fraction of values from each distribution and then create a random permutation of the values:
N = 10000
frac =0.99
rand_mix = sample( c( rnorm( frac*N, 0, sd=1) , rnorm( (1-frac)*N, 0, sd=10) ) )
> table( abs(rand_mix) >1.96)
FALSE TRUE
9364 636
> (100000-636)/100000
[1] 0.99364
> table( rnorm(10000) >6)
FALSE
10000
The fraction is fixed. If you wante a possibly random fraction (but close to 0.99 statistically) then try this:
> table( sample( c( rnorm(10e6), rnorm(10e4, sd=10) ), 10e4) > 1.96 )
FALSE TRUE
97151 2849
Compare with:
> N = 100000
> frac =0.99
> rand_mix = sample( c( rnorm( frac*N, 0, sd=1) , rnorm( (1-frac)*N, 0, sd=10) ) )
> table( rand_mix > 1.96 )
FALSE TRUE
97117 2883
Here is a nice solution for anyone here:
n <- 100000
prob1 <- 0.99
prob2 <- 1-prob1
dist1 <- rnorm(prob1*n, 0, 1)
dist2 <- rnorm(prob2*n, 0, 10)
actual_sample <- c(dist1, dist2)
I have a vector A which contains zeros and ones. I would like to randomly change n percent of the ones to zero. Is this the best way to do it in R (10% change):
for (i in 1:length(A))
{
if(A[i] > 0)
{
if(runif(1) <= 0.1)
{
A[i] = 0
}
}
}
Thanks.
You can do this without using the for loops and if statements:
##Generate some data
R> A = sample(0:1, 100, replace=TRUE)
##Generate n U(0,1) random numbers
##If any of the U's are less then 0.1
##Set the corresponding value in A to 0
R> A[runif(length(A)) < 0.1] = 0
The other point to note, is that you don't have to do anything special for values of A that actually equal 0, as the probability of change a 1 to a 0 is still 0.1.
As Hadley points out, your code doesn't randomly change 10% of 1's to 0. If that is really your intention, then:
##Select the rows in A equal to 1
R> rows_with_1 = (1:length(A))[A==1]
##Randomly select a % of these rows and set equal to zero
##Warning: there will likely be some rounding here
R> A[sample(rows_with_1, length(rows_with_1)*0.1)] = 0
If this is your A:
A <- round(rnorm(100, 0.5, 0.1))
This should do it:
n <- 10
A[sample(A[A==1], length(A[A==1])*n/100)] <- 0
where n is the percentage of your 1s that you want to change to 0s.
You can vectorize that:
A <- round(runif(20), 0)
A[sample(which(A == 1), 0.1 * length(A == 1))] <- 0
HTH