Confusion Between 'sample' and 'rbinom' in R - r

Why are these not equivalent?
#First generate 10 numbers between 0 and .5
set.seed(1)
x <- runif(10, 0, .5)
These are the two statements I'm confused by:
#First
sample(rep(c(0,1), length(x)), size = 10, prob = c(rbind(1-x,x)), replace = F)
#Second
rbinom(length(x), size = 1, prob=x)
I was originally trying to use 'sample'. What I thought I was doing was generating ten (0,1) pairs, then assigning the probability that each would return either a 0 or a 1.
The second one works and gives me the output I need (trying to run a sim). So I've been able to solve my problem. I'm just curious as to what's going on under the hood with 'sample' so that I can understand R better.

The first area of difference is the location of the length of the vector specification in the parameter list. The names size have different meanings in these two functions. (I hadn't thought about that source of confusion before, and I'm sure I have made this error myself many times.)
The random number generators (starting with r and having a distribution suffix) have that choice as the first parameter, whereas sample has it as the second parameter. So the length of the second one is 10 and the length of the first is 1. In sample the draw is from the values in the first argument, while 'size' is the length of the vector to create. In the rbinom function, n is the length of the vector to create, while size is the number of items to hypothetically draw from a theoretical urn having a distribution determined by 'prob'. The result returned is the number of "ones". Try:
rbinom(length(x), size = 10, prob=x)
Regarding the argument to prob: I don't think you need the c().

The difference between the two function is quite simple.
Think of a pack of shuffled cards, and choose a number of cards from it. That is exactly the situation that sample simulates.
This code,
> set.seed(123)
> sample(1:40, 5)
[1] 12 31 16 33 34
randomly extract five numbers from the 1:40 vector of numbers.
In your example, you set size = 1. It means you choose only one element from the pool of possible values. If you set size = 10 you will get ten values as you desire.
set.seed(1)
x <- runif(10, 0, .5)
> sample(rep(c(0,1), length(x)), size = 10, prob = c(rbind(1-x,x)), replace = F)
[1] 0 0 0 0 0 0 0 1 0 1
Instead, the goal of the rbinom function is to simulate events where the results are "discrete", such as the flip of a coin. It considers, as parameters, the probability of success on a trial, such as the flip of the coin, according to a given probability of 0.5. Here we simulate 100 flips. If you think that the coin could be stacked in order to favor one specific outcome, we could simulate this behaviour by setting probability equals to 0.8, as in the example below.
> set.seed(123)
> table(rbinom(100, 1, prob = 0.5))
0 1
53 47
> table(rbinom(100, 1, prob = 0.8))
0 1
19 81

Related

How to generate a series of number by function of sample in R, with given different probability in each try?

For example I have a vector about possibility is
myprob <- (0.58, 0.51, 0.48, 0.46, 0.62)
And I want to sampling a series of number between 1 and 0 each time by the probability of c(1-myprob, myprob),
which means in the first number in the series, the function sample 1 and 0 by (0.42, 0.58), the second by (0.49, 0.50) and so on,
how can I generate the 5 numbers by sample?
The syntax of
Y <- sample(c(1,0), 1, replace=F, prob=c(1-myprob, prob))
would have incorrect number of probabilities and only 1 number output if I specify the prob;
while the syntax of
Y <- sample(c(1,0), 5, replace=F, prob=c(1-myprob, prob))
would have the probabilities focus on only 0.62(or not I am not sure, but the results seems not correct at all)
Thanks for any reply in advance!
If myprob is the probability of drawing 1 for each iteration, then you can use rbinom, with n = 5 and size = 1 (5 iterations of a 1-0 draw).
set.seed(2)
rbinom(n = 5, size = 1, prob = myprob)
[1] 1 0 1 0 0
Maƫl already proposed a great solution sampling from a binomial distribution. There are probably many more alternatives and I just wanted to suggest two of them:
runif()
as.integer(runif(5) > myprob)
This will first generate a series of 5 uniformly distributed random numbers between 0 and 1, then compare that vector against myprob and convert the logical values TRUE/FALSE to 1/0.
vapply(sample())
vapply(myprob, function(p) sample(1:0, 1, prob = c(1-p, p)), integer(1))
This is what you may have been looking for in the first place. This executes the sample() command by iterating over the values of myprob as p and returns the 5 draws as a vector.

Generate random numbers with rbinom but exclude 0s from the range

I need to generate random numbers with rbinom but I need to exclude 0 within the range.
How can I do it?
I would like something similar to:
k <- seq(1, 6, by = 1)
binom_pdf = dbinom(k, 322, 0.1, log = FALSE)
but I need to get all the relative dataset, because if I do the following:
binom_ran = rbinom(100, 322, 0.1)
I get values from 0 to 100.
Is there any way I can get around this?
Thanks
Let`s suppose that we have the fixed parameters:
n: number of generated values
s: the size of the experiment
p: the probability of a success
# Generate initial values
U<-rbinom(n,s,p)
# Number and ubication of zero values
k<-sum(U==0)
which.k<-which(U==0)
# While there is still a zero, . . . generate new numbers
while(k!=0){
U[which.k]<-rbinom(k,s,p)
k<-sum(U==0)
which.k<-which(U==0)
# Print how many zeroes are still there
print(k)
}
# Print U (without zeroes)
U
In addition to the hit and miss approach, if you want to sample from the conditional distribution of a binomial given that the number of successes is at least one, you can compute the conditional distribution then directly sample from it.
It is easy to work out that if X is binomial with parameters p and n, then
P(X = x | X > 0) = P(X = x)/(1-p)
Hence the following function will work:
rcond.binom <- function(k,n,p){
probs <- dbinom(1:n,n,p)/(1-p)
sample(1:n,k,replace = TRUE,prob = probs)
}
If you are going to call the above function numerous times with the same n and p then you can just precompute the vector probs and simply use the last line of the function whenever you need it.
I haven't benchmarked it, but I suspect that the hit-and-miss approach is preferable when k is small, p not too close to 0, and n large, but for larger k larger, p closer to 0, and n smaller then the above might be preferable.

R - rbinom; what does the probability of success define if there is N number of observation?

In R, you can generate the data from multinomial distribution using rbinom. For example, if you do
rbinom(400, 1, 0.2)
It generates 400 points of 0 or 1 with the probability of 0.2 that the data point is 1. So, the second argument is the number of trials, but I don't exactly know that that means. What is the number of trials? If I set this to be 1, I see the values of 0 or 1, and if I set it to be N, I see the values of 0 - N.
The size is the total number of trials, of which size*prob are expected to be successes.
rbinom(400, size = 10, prob = 0.2)
gives the results of 400 runs of 10 coin flips each, returning the number of successes in each run.
So, rbinom(1, 10, 0.2) has the same expected value as sum(rbinom(10, 1, 0.2)).

Find first increasing value in vector

I draw a random sample from Uniform Distribution by
u <- runif (1000,0,1)
Now I want to calculate the value of this random variable
N = min_n {n : u_n > u_{n-1}}
Edit
Let say I draw a random sample of size 10.
So, I have u= (u_1,u_2,u_3,...,u_10). Now I want to find minimum n for which u_n > u_{n-1}
If you take the difference (using diff) then you're looking for where the difference is greater than 0. We search for the first time that happens
u <- c(.5, .4, .3, .6)
min(which(diff(u) > 0))
This gives us 3 which is close to what we want but not exactly. Since this will return 1 if the first difference is greater than 0 what we really want to do is add 1 to the result
min(which(diff(u) > 0))) + 1
which should give what we want. This will give a warning if your sequence is strictly descending though since it can't find a value that meets the criteria. We could code in some tests and decide on the appropriate output in that case but I'll leave that as an exercise for the reader.

increase the number of defaulters in a sample

I have a banking dataset which has 5% defaulters and the rest are good( non-defaulters).
I want to create a sample which has 30% defaulters , 70% non-defaulters.
Assuming my dataset is data and it has a column named "default" signifying 0 or 1, how do i get a sample with 30% default, 70% non-default given that my original dataset has only 5% default.
Can some one please provide the R code. That would be great.
I tried the following to get 100 random samples with replacement
data[sample(1:nrow(data),size=100,replace=TRUE),]
But how do i ensure that I get that the split is 30%,70%?
sample has an option prob that represents a vector of probability weights for obtaining the elements of the vector being sampled. So you could use prob=c(0.3,0.7) as a parameter to sample.
For example
sample(0:1, 100, replace=TRUE, prob=c(0.3,0.7))
Assume df is your dataframe and default is the column indicating who defaults.
To sample without replacement:
df[c(sample(which(df$default),30), sample(which(!df$default),70)),]
To sample with replacement (i.e., possibly duplicating records):
df[c(sample(which(df$default),30,TRUE), sample(which(!df$default),70,TRUE)),]
Alternatively, if you don't want to specify an exact number of defaulters and non-defaulters, you can specify a sampling probability for each row:
set.seed(1)
df <- data.frame(default=rbinom(250,1,.5), y=rnorm(250))
n <- 100 # could be any number, but closer you get to nrow(df) the less the weights matters
s <- sample(seq_along(df$default), n, prob=ifelse(df$default, .3, .7))
table(df$default[s])
#
# 0 1
# 61 39
n <- 150 # could be any number, but closer you get to nrow(df) the less the weights matters
s <- sample(seq_along(df$default), n, prob=ifelse(df$default, .3, .7))
table(df$default[s])
#
# 0 1
# 97 53

Resources