I have a question which is basically the vectorized R solution to the following matlab problem:
Generate random number with given probability matlab
I'm able to generate the random event outcome based on random uniform number and the given probability for each single event (summing to 100% - only one event may happen) by:
sum(runif(1,0,1) >= cumsum(wdOff))
However the function only takes a single random uniform number, whereas I want it to take a vector of random uniform numbers and output the corresponding events for these entries.
So basically I'm looking for the R solution to Oleg's vectorized solution in matlab (from the comments to the matlab solution):
"Vectorized solution: sum(bsxfun(#ge, r, cumsum([0, prob]),2) where r is a column vector and prob a row vector. – Oleg"
Tell me if you need more information.
You could just do a weighted random sample, without worrying about your cumsum method:
sample(c(1, 2, 3), size = 100, replace = TRUE, prob = c(0.5, 0.1, 0.4))
If you already have the numbers, you could also do:
x <- runif(10, 0, 1)
as.numeric(cut(x, breaks = c(0, 0.5, 0.6, 1)))
Related
In an attempt to avoid nesting for loops 6-7 times, I am trying to use lapply to find the proportion of randomly drawn values (that are combined in a certain way) that exceed some arbitrary thresholds values. The problem is that I have several parameters that each vary a certain number of ways, and these, in turn, will affect how the values are combined. The goal is to use the results in an ANOVA to see how varying these parameters contributes to reaching those thresholds. However, I don't understand how to do this. I have a feeling that anonymous functions could be useful, but I don't understand how they work with more than 1 parameter.
I tried to simplify the code as much as possible. But again, there are just so many parameters that must be included.
trials = 10
data_means = c(0,1,2,3)
prior_samples = c(2, 8, 32)
data_SD = c(0.5, 1, 2)
thresholds = c(10, 30, 80)
The idea is that there are two distributions, data and prior, which I draw values from. I always draw one from data, but I draw a sample (see prior_samples) of values from the prior distribution. There are four different values that determine the mean of the data distribution (see data_means), but the values are drawn the same number of times (determined by trials) from each of these four "versions" of the data distribution. These are then put into nested lists:
set.seed(123)
data_list = list()
for (nMean in data_means){ #the data values
for (nTrial in 1:trials){
data_list[[paste(nMean, sep="_")]][[paste(nTrial, sep="_")]] = rnorm(1, nMean, 1)
}
}
prior_list = list()
for (nSamples in prior_samples){ #the prior values
for (nTrial in 1:trials){
prior_list[[paste(nSamples, sep="_")]][[paste(nTrial, sep="_")]] = rnorm(nSamples, 0, 1)
}
}
Then I create another list for the prior values, because I want to calculate the means and standard deviations (SD) of the samples of prior values. I include normal SD, as well as SD/2 and SD*2:
prior_SD = list("mean"=0, "standard_devations"=list("SD/2"=0, "SD"=0, "SD*2"=0))
prior_mean_SD = rep(list(prior_SD), trials)
prior_nested_list = list("2"=prior_mean_SD, "8"=prior_mean_SD, "32"=prior_mean_SD)
for (nSamples in 1:length(prior_samples)){
for (nTrial in 1:trials){
prior_nested_list[[nSamples]][[nTrial]][["mean"]]=mean(prior_list[[nSamples]][[nTrial]])
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD/2"]]=sum(sd(prior_list[[nSamples]][[nTrial]])/2)
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD"]]=sd(prior_list[[nSamples]][[nTrial]])
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD*2"]]=sum(sd(prior_list[[nSamples]][[nTrial]])*2)
}
}
Then I combinde the values from the data list and the last list, using list.zip from rlist:
library(rlist)
dataMean0 = list.zip(dMean0=data_list[["0"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean1 = list.zip(dMean1=data_list[["1"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean2 = list.zip(dMean2=data_list[["2"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean3 = list.zip(dMean3=data_list[["3"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
all_values = list(mean_difference0=dataMean0, mean_difference1=dataMean1,
mean_difference2=dataMean2, mean_difference3=dataMean3)
Now comes the tricky part. I combine the data values and the prior values in all_values by using this custom function for the Kullback-Leibler divergence. As you can see, there are 6 parameters that varies:
mean_diff refers to the means of the data distribution (data_means). It is named mean_diff beacsue it refers to the difference in mean between the prior distribution (which is always 0), and the data distribution (which can be 0, 1, 2 or 3).
trial refers to trials,
pSample refers to the numbers of samples drawn from the prior distribution (prior_samples)
p_SD refers to the calculations of the SD based on the prior samples (normal SD, SD/2, SD*2)
data_SD refers to the SD of the data distribution, determined by data_SD
threshold refers to thresholds
The Kullback-Leibler divergence function:
kld = function(mean_diff, trial, pSample, p_SD, data_SD, threshold){
prior_mean = all_values[[mean_diff]][[trial]][[pSample]][["mean"]]
data_mean = all_values[[mean_diff]][[trial]][["mean"]]
prior_SD = all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]
posterior_SD = sqrt(1/(1/
((all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]
*all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]))
+1/(data_SD*data_SD)))
length(
which(
(log(prior_SD/posterior_SD) +
(((posterior_SD*posterior_SD) +
(prior_mean -
(((data_SD*data_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*prior_mean +
((prior_SD*prior_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*data_mean))^2)
/(2*(prior_SD*prior_SD)))-0.5
+
log(posterior_SD/prior_SD) +
((((prior_SD*prior_SD)) +
(prior_mean -
(((data_SD*data_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*prior_mean +
((prior_SD*prior_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*data_mean))^2)
/(2*(posterior_SD*posterior_SD)))-0.5
)>=threshold))/trials
}
So the question is how can one use lapply on the list with all the values (all_values) while using all the different combinations of the six parameters that are included? The data I want to end up with is the proportions of values (percentage of trials) that exceed the thresholds in all the parameter combinations.
I can't find the info I need, so any tips would be appreciated.
I'm trying to make a variable, Var, that takes the value 0 60% of the time, and 1 otherwise, with 50 000 observation.
For a normally distributed, I remember doing the following for a normal distribution, to define n:
Var <- rnorm(50 000, 0, 1)
Is there a way I could combine an ifelse command with the above to specify the number of n as well as the probability of Var being 0?
I would use rbinom like this:
n_ <- 50000
p_ <- 0.4 # it's probability of 1s
Var <- rbinom(n=n_, size=1, prob=p_)
By using of variables, you can change the size and/or probability just by changing of those variables. Hope that's what you are looking for.
If by 60% you mean a probability equal to 0.6 (rather than an empirical frequency), then
Var <- sample(0:1, 50000, prob = c(6, 4), replace = TRUE)
gives a desired sequence of independent Bernoulli(0.6) realizations.
I'm picking nits here, but it actually isn't completely clear exactly what you want.
Do you want to simulate a sample of 50000 from the distribution you describe?
Or, do you want 50000 replications of simulating an observation from the distribution you describe?
These are different things that, in my opinion, should be approached differently.
To simulate a sample of size 50000 from that distribution you would use:
sample(c(0,1), size = 50000, replace = TRUE)
To replicate 50000 simulations of sampling from the distribution you describe I would recommend:
replicate(50000, sample(c(0,1), size = 1, prob = c(0.6, 0.4)))
This might seem silly since these two lines of code produce exactly the same thing, in this case.
But suppose your goal was to investigate properties of samples of size 50000? Then what you would use a bunch (say, 1000) of replication of that first line of code above wrapped inside replicate:
replicate(1000, sample(c(0,1), size = 50000, prob = c(0.6, 0.4), replace = TRUE))
I hope I haven't been too pedantic about this. Having seen simulations go awry it has become my belief that one should keep separate the thing being simulated from the number of simulations you decide to do. The former is fundamental to your problem, while the latter only affects the accuracy of the simulation study and how long it takes.
I need to generate random numbers with rbinom but I need to exclude 0 within the range.
How can I do it?
I would like something similar to:
k <- seq(1, 6, by = 1)
binom_pdf = dbinom(k, 322, 0.1, log = FALSE)
but I need to get all the relative dataset, because if I do the following:
binom_ran = rbinom(100, 322, 0.1)
I get values from 0 to 100.
Is there any way I can get around this?
Thanks
Let`s suppose that we have the fixed parameters:
n: number of generated values
s: the size of the experiment
p: the probability of a success
# Generate initial values
U<-rbinom(n,s,p)
# Number and ubication of zero values
k<-sum(U==0)
which.k<-which(U==0)
# While there is still a zero, . . . generate new numbers
while(k!=0){
U[which.k]<-rbinom(k,s,p)
k<-sum(U==0)
which.k<-which(U==0)
# Print how many zeroes are still there
print(k)
}
# Print U (without zeroes)
U
In addition to the hit and miss approach, if you want to sample from the conditional distribution of a binomial given that the number of successes is at least one, you can compute the conditional distribution then directly sample from it.
It is easy to work out that if X is binomial with parameters p and n, then
P(X = x | X > 0) = P(X = x)/(1-p)
Hence the following function will work:
rcond.binom <- function(k,n,p){
probs <- dbinom(1:n,n,p)/(1-p)
sample(1:n,k,replace = TRUE,prob = probs)
}
If you are going to call the above function numerous times with the same n and p then you can just precompute the vector probs and simply use the last line of the function whenever you need it.
I haven't benchmarked it, but I suspect that the hit-and-miss approach is preferable when k is small, p not too close to 0, and n large, but for larger k larger, p closer to 0, and n smaller then the above might be preferable.
I am working currently on generating some random data for a school project.
I have created a variable in R using a binomial distribution to determine if an observation had a loss yes=1 or not=0.
Afterwards I am trying to generate the loss amount using a random distribution for all observations which already had a loss (=1).
As my loss amount is a percentage it can be anywhere between 0
What Is The Intuition Behind Beta Distribution # stats.stackexchange
In a third step I am looking for an if statement, which combines my two variables.
Please find below my code (which is only working for the Loss_Y_N variable):
Loss_Y_N = rbinom(1000000,1,0.01)
Loss_Amount = dbeta(x, 10, 990, ncp = 0, log = FALSE)
ideally I can combine the two into something like
if(Loss_Y_N=1 then Loss_Amount=dbeta(...) #... is meant to be a random variable with mean=0.15 and should be 0<x=<1
else Loss_Amount=0)
Any input highly appreciated!
Create a vector for your loss proportion. Fill up the elements corresponding to losses with draws from the beta. Tweak the parameters for the beta until you get the desired result.
N <- 100000
loss_indicator <- rbinom(N, 1, 0.1)
loss_prop <- numeric(N)
loss_prop[loss_indicator > 0] <- rbeta(sum(loss_indicator), 10, 990)
I guess it has been asked before, but I'm still a bit rusty about "sample" and "rbinom" functions in R, and would like to ask the following two simple questions:
a) Let's say we have:
rbinom(n = 5, size = 1, prob = c(0.9,0.2,0.3))
So "n" = 5 but "prob" is only indicated for three of them. What values R assigns for these two n's?
b) Let's say we have:
sample(x = 1:3, size = 1, prob = c(.5,0.2,0.9))
According to R-help (?sample):
The optional prob argument can be used to give a vector of weights
for obtaining the elements of the vector being sampled.
They need not sum to one, but they should be non-negative and not all zero.
The question would be: why "prob" does not need sum to one?
Any answers would be very appreciated: thank you!
From the documentation for rbinom:
The numerical arguments other than n are recycled to the length of the result.
This means that in your example the prob vector you pass in will be recycled until it reaches the required length (presumably 5). So the vector which will be used is:
c(0.9, 0.2, 0.3, 0.9, 0.2)
As for the sample function, as #thelatemail pointed out the probabilities do not have to sum to 1. It appears that the prob vector gets normalized to 1 internally.