How do I randomly generate data that is distributed according to my own density function in R? - r

I'm looking for a method that lets me randomly generate data according to my own pre-defined Probability density function
Is there a method that lets me do that? Or at least one that generates data according to this specific function?

Assuming that your own pre-defined pdf has $y \in {0, 1}$, the pdf is the pdf of a Bernoulli distribution with parameter $\pi$.
Using that a Bernoulli random variable corresponds to a Binomial with number of trials equal to 1 ($n=1$), you can draw from the pdf using the following code:
pi <- 0.5
n <- 10 # Number of draws from specified pdf
draws <- rbinom(n, 1, pi) # Bernoulli corresponds to Binom with `size = 1`
print(draws)
# Outputs: [1] 1 0 0 1 0 0 1 1 0 1

Related

Generating data from correlation matrix: the case of bivariate distributions [duplicate]

This question already has an answer here:
Simulating correlated Bernoulli data
(1 answer)
Closed 1 year ago.
An apparently simple problem: I want to generate 2 (simulated) variables (x, y) from a bivariate distribution with a given matrix of correlation between them. In other wprds, I want two variables/vectors with values of either 0 or 1, and a defined correlations between them.
The case of normal distribution is easy with the MASS package.
df_norm = mvrnorm(
100, mu = c(x=0,y=0),
Sigma = matrix(c(1,0.5,0.5,1), nrow = 2),
empirical = TRUE) %>%
as.data.frame()
cor(df_norm)
x y
x 1.0 0.5
y 0.5 1.0
Yet, how could I generate binary data from the given matrix correlation?
This is not working:
df_bin = df_norm %>%
mutate(
x = ifelse(x<0,0,1),
y = ifelse(y<0,0,1))
x y
1 0 1
2 0 1
3 1 1
4 0 1
5 1 0
6 0 0
7 1 1
8 1 1
9 0 0
10 1 0
Although this creates binary variables, but the correlation is not (even close to) 0.5.
cor(df_bin)
x y
x 1.0000000 0.2994996
y 0.2994996 1.0000000
Ideally I would like to be able to specify the type of distribution as an argument in the function (as in the lm() function).
Any idea?
I guessed that you weren't looking for binary, as in values of either zero or one. If that is what you're looking for, this isn't going to help.
I think what you want to look at is the construction of binary pair-copula.
You said you wanted to specify the distribution. The package VineCopula would be a good start.
You can use the correlation matrix to simulate the data after selecting the distribution. You mentioned lm() and Gaussian is an option - (normal distribution).
You can read about this approach through Lin and Chagnaty (2021). The package information isn't based on their work, but that's where I started when I looked for your answer.
I used the correlation of .5 as an example and the Gaussian copula to create 100 sets of points in this example:
# vine-copula
library(VineCopula)
set.seed(246543)
df <- BiCopSim(100, 1, .5)
head(df)
# [,1] [,2]
# [1,] 0.07585682 0.38413426
# [2,] 0.44705686 0.76155029
# [3,] 0.91419758 0.56181837
# [4,] 0.65891869 0.41187594
# [5,] 0.49187672 0.20168128
# [6,] 0.05422541 0.05756005

how can I decide a series of 0-1 combination is stochastically distributed?

I have a series composed by 0 and 1, and the 0 shows up without specfic order (as far as I can tell), how can I decide if the 0 is stochastically distributed?
pls find the toy sample for reference
library(magrittr)
s1 <- runif(10)*10 %>% mod(10) %>% round(0) %>% `>`(5) %>% ifelse(1,0)
s2 <- c(0,0,1,0,1,1,1,0,1,0)
The runs test is what you want:
The Wald–Wolfowitz runs test (or simply runs test), named after
statisticians Abraham Wald and Jacob Wolfowitz is a non-parametric
statistical test that checks a randomness hypothesis for a two-valued
data sequence. More precisely, it can be used to test the hypothesis
that the elements of the sequence are mutually independent.
It is implemented in the snpar package.
Are you looking for rbinom? This function simulates a Bernoulli process with a chance of success (1) equal to some probability p. Otherwise, the result is 0.
The usage of rbinom is rbinom(n, size, prob), where n is the number of random numbers to generate, size is the number of trials, and prob is the probability of getting a success. So to generate a bunch of binomial random numbers with equal probability of 1 or 0, use:
set.seed(100) # for reproducibility
rbinom(n = 10, size = 1, prob = 0.5)
[1] 0 0 1 0 0 0 1 0 1 0

How to generate n random numbers from negative binomial distribution?

I am trying to make a function in order to generate n random numbers from negative binomial distribution.
To generate it, I first made a function to generate n random variables from geometric distribution. My function for generating n random numbers from geometric distribution as follows:
rGE<-function(n,p){
I<-rep(NA,n)
for (j in 1:n){
x<-rBer(1,p)
i<-1 # number of trials
while(x==0){
x<-rBer(1,p)
i<-i+1
}
I[j]<- i
}
return(I)
}
I tested this function (rGE), for example for rGE(10,0.5), which is generating 10 random numbers from a geometric distribution with probability of success 0.5, a random result was:
[1] 2 4 2 1 1 3 4 2 3 3
In rGE function I used a function named rBer which is:
rBer<-function(n,p){
sample(0:1,n,replace = TRUE,prob=c(1-p,p))
}
Now, I want to improve my above function (rGE) in order to make a function for generating n random numbers from a negative binomial function. I made the following function:
rNB<-function(n,r,p){
I<-seq(n)
for (j in 1:n){
x<-0
x<-rBer(1,p)
i<-1 # number of trials
while(x==0 & I[j]!=r){
x<-rBer(1,p)
i<-i+1
}
I[j]<- i
}
return(I)
}
I tested it for rNB(3,2,0.1), which generates 3 random numbers from a negative binomial distribution with parametrs r=2 and p=0.1 for several times:
> rNB(3,2,0.1)
[1] 2 1 7
> rNB(3,2,0.1)
[1] 3 1 4
> rNB(3,2,0.1)
[1] 3 1 2
> rNB(3,2,0.1)
[1] 3 1 3
> rNB(3,2,0.1)
[1] 46 1 13
As you can see, I think my function (rNB) does not work correctly, because the results always generat 1 for the second random number.
Could anyone help me to correct my function (rNB) in order to generate n random numbers from a negative binomial distribution with parametrs n, r, and p. Where r is the number of successes and p is the probability of success?
[[Hint: Explanations regarding geometric distribution and negative binomial distribution:
Geometric distribution: In probability theory and statistics, the geometric distribution is either of two discrete probability distributions:
The probability distribution of the number X of Bernoulli trials needed to get one success, supported on the set { 1, 2, 3, ... }.
The probability distribution of the number Y = X − 1 of failures before the first success, supported on the set { 0, 1, 2, 3, ... }
Negative binomial distribution:A negative binomial experiment is a statistical experiment that has the following properties:
The experiment consists of x repeated trials.
Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure.
The probability of success, denoted by P, is the same on every trial.
The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials.
The experiment continues until r successes are observed, where r is specified in advance.
]]
Your function will be much faster if you use R's native vectorization. The way you can do this is to generate all your Bernoulli trials at once.
Note that for a negative binomial distribution, the expected value (i.e. the mean number of Bernoulli trials it will take to get r successes) is r * p / (1 - p) (Reference)
If we want to draw n negative binomial samples, then the expected total number of Bernoulli trials will therefore be n * r * p / (1 - p). So we want to draw at least that many Bernoulli samples. For simplicity, we can start by drawing twice that number: 2 * n * r * p / (1 - p) . In the unlikely case that this is not enough, we can draw twice as many again repeatedly until we have enough; once the sum of the resultant vector of Bernoulli trials is greater than r * n, we know we have enough Bernoulli trials to simulate our n negative binomial trials.
We can now run a cumsum on the vector of Bernoulli trials to keep track of the number of positive trials. If you then perform integer division on this vector by %/% r, you will have all the Bernoulli trials labelled according to which negative binomial trial they belonged to. You then table this vector.
The first r numbers of the table (obtained by subsetting the table by [1:n] or equivalently by [seq(n)] is your negative binomial draw. We just remove the table's names by using as.numeric. We also subtract the number of successes (i.e. r), from each of our counts, since we are only counting the failures, not the successes.
rNB <- function(n, r, p) {
mult <- 2
all_samples <- 0
while(sum(all_samples) < n * r)
{
all_samples <- rBer(mult * n * r * p / (1 - p), p)
mult <- mult * 2
}
as.numeric(table(cumsum(all_samples) %/% r))[seq(n)] - r
}
So we can do:
rNB(3, 2, 0.1)
#> [1] 14 19 41
rNB(3, 2, 0.1)
#> [1] 23 6 56
rNB(3, 2, 0.1)
#> [1] 11 31 59
rNB(3, 2, 0.1)
#> [1] 7 21 14
mean(rNB(10000, 2, 0.1))
#> [1] 18.0002
We can test this against R's own rnbinom:
mean(rnbinom(10000, 2, 0.1))
#> [1] 18.0919
hist(rnbinom(10000, 2, 0.5), breaks = 0:20)
hist(rNB(10000, 2, 0.5), breaks = 0:20)
Note that the logic of your own version isn't quite right. In particular, the line while(x == 0 & I[j] != r) doesn't make any sense. I is a vector of 1:n, so in your example, whenever j is 2, I[j] is equal to r and the loop stops. This is why your second number is always 1. I don't know what you were trying to do here.
If you want to do it one Bernoulli trial at a time, as you are doing in your own version, try this modified function. The variable names should hopefully make it easy to follow the logic:
rNB <- function(n, r, p) {
# Create an empty vector of length n for our results
draws <- numeric(n)
# Now for each of the n trials we will get a negative binomial sample:
for (i in 1:n) {
# Create success and failure counters for this draw
failures <- successes <- 0
# Now run Bernoulli trials, counting successes and failures as we go
# until we hit r successes
while(successes < r)
{
if(rBer(1, p) == 1)
successes <- successes + 1
else
failures <- failures + 1
}
# Once we have reached r successes, the current number of failures is our
# negative binomial draw
draws[i] <- failures
}
return(draws)
}
This gives identical results to the faster, albeit more opaque, vectorized version.

Choose specific number with probability

How can one choose a number with a specific probability p?
Say we must choose between {0, 1} and the probability p stands for choosing 1.
So when p=0.8 we choose 1 with 80% and 0 with 20%.
Is there a simple solution in R for this?
Take a look at sample function.
> set.seed(1)
> sample(c(0,1), size=10, replace=TRUE, prob=c(0.2,0.8))
[1] 1 1 1 0 1 0 0 1 1 1
From the helpfile you can read:
sample takes a sample of the specified size from the elements of x using either with or without replacement.
and the argument prob in sample acts as ...
A vector of probability weights for obtaining the elements of the vector being sampled.

Random Pareto distribution in R with 30% of values being <= specified amount

Let me begin by saying this is a class assignment for an intro to R course.
First, in VGAM why are there dparetoI, ParetoI, pparetoI, qparetoI & rparetoI?
Are they not the same things?
My problem:
I would like to generate 50 random numbers in a pareto distribution.
I would like the range to be 1 – 60 but I also need to have 30% of the values <= 4.
Using VGAM I have tried a variety of functions and combinations of pareto from what I could find in documentation as well as a few things online.
I experimented with fit, quantiles and forcing a sequence from examples I found but I'm new and didn't make much sense of it.
I’ve been using this:
alpha <- 1 # location
k <- 2 # shape
mySteps <- rpareto(50,alpha,k)
range(mySteps)
str(mySteps[mySteps <= 4])
After enough iterations, the range will be acceptable but entries <= 4 are never close.
So my questions are:
Am I using the right pareto function?
If not, can you point me in the right direction?
If so, do I just keep running it until the “right” data comes up?
Thanks for the guidance.
So reading the Wikipedia entry for Pareto Distribution, you can see that the CDF of the Pareto distribution is given by:
FX(x) = 1 - (xm/x)α
The CDF gives the probability that X (your random variable) < x (a given value). You want Pareto distributions where
Prob(X < 4) ≡ FX(4) = 0.3
or
0.3 = 1 - (xm/4)α
This defines a relation between xm and α
xm = 4 * (0.7)1/α
In R code:
library(VGAM)
set.seed(1)
alpha <- 1
k <- 4 * (0.7)^(1/alpha)
X <- rpareto(50,k,alpha)
quantile(X,0.3) # confirm that 30% are < 4
# 30%
# 3.891941
Plot the histogram and the distribution
hist(X, breaks=c(1:60,Inf),xlim=c(1,60))
x <- 1:60
lines(x,dpareto(x,k,alpha), col="red")
If you repeat this process for different alpha, you will get different distribution functions, but in all cases ~30% of the sample will be < 4. The reason it is only approximately 30% is that you have a finite sample size (50).

Resources