R: select a subset based on probability - r

I'm new to R. I have a normal distribution.
n <- rnorm(1000, mean=10, sd=2)
As an exercise I'd like to create a subset based on a probability curve derived from the values. E.g for values <5, I'd like to keep random 25% entries, for values >15, I'd like to keep 75% random entries, and for values between 5 and 15, I'd like to linearly interpolate the probability of selection between 25% and 75%. Seems like what I want is the "sample" command and its "prob" option, but I'm not clear on the syntax.

For the first two subsets we may use
idx1 <- n < 5
ss1 <- n[idx1][sample(sum(idx1), sum(idx1) * 0.25)]
idx2 <- n > 15
ss2 <- n[idx2][sample(sum(idx2), sum(idx2) * 0.75)]
while for the third one,
idx3 <- !idx1 & !idx2
probs <- (n[idx3] - 5) / 10 * (0.75 - 0.25) + 0.25
ss3 <- n[idx3][sapply(probs, function(p) sample(c(TRUE, FALSE), 1, prob = c(p, 1 - p)))]
where probs are linearly interpolated probabilities for each of element of n[idx3]. Then using sapply we draw TRUE (take) or FALSE (don't take) for each of those elements.

The prob option in sample() gives weigths of probability to the vector to sample.
https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample
So if I understood the question right what you want is to sample only 25% of the values < 5 and 75% for values > 75 and so on ..
Then you have to use the n parameter
As documentation says
n
a positive number, the number of items to choose from. See ‘Details.’
There you could input the % of sample you want multiplied by the length of the sample vector.
For your last sample you could add a uniform variable to run from .25 to .75 runif()
Hope this helps!

Related

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

Simulate mixture data with different mix dependecies structure between each two variables?

I would like to simulate a mixture data, say 3 dimensional data. I would like to have 2 different components between each two variables.
That is, simulate mixture data (V1 and V2) where the dependencies between them is two different normal components. Then, between V2 and V3 another two normal components. So, I will have 3d data, the dependence between first and a second variable are a mixture of two normals. And the dependence between the second and third variable are mixture of another two different components.
Another way to explain my question:
Suppose I would like to generate a mixture data as follows:
1- 0.3 normal(0.5,1) + 0.7 normal(2,4) # hence here I will get a bivariate mixture data generated from two different normal (two components of the mixture model), the sum of mixuter weight is 1.
Then, I would like to get another variable as follows:
2- 0.5 normal(2,4) # this is the second variable on the first simulate + 0.5 normal(2,6)
so here, I get 3d simulated mixture data, where V1 and V2 are generated by two different mixture components, and the V2 and V3 are generated by another different mixture components.
This is how to generate the data in r: ( I belive it is not generate a bivariate data)
N <- 100000
#Sample N random uniforms U
U <- runif(N)
#Variable to store the samples from the mixture distribution
rand.samples <- rep(NA,N)
#Sampling from the mixture
for(i in 1:N) {
if(U[i]<.3) {
rand.samples[i] <- rnorm(1,1,3)
} else {
rand.samples[i] <- rnorm(1,2,5)
}
}
so if we generate mixture bivariate data (two variables) then how can extend this to have 4 or 5 variables, where V1 and V2 are generated from two different normals (the dependencies structures between them is a mixture of two normals) and then V3 will generated from another another differetn normal and then compine with V2. That is when we plot the V2 ~ V3 we will find that the dependencies structures between them is a mixutre of two normals and so on.
I am not really sure I have correctly understood the question but I will give it a try. You have 3 distributions D1, D2 and D3. From these three distributions you would like to create variables that use 2 out of those 3 but not the same ones.
Since I do not know how the distributions should be combined I used the flags using the binomial distribution (its a vector of length equal to 200 with 0s and 1s) to determine from which distribution each value will be picked (You can change that if that is not how you want it done).
D1 = rnorm(200,2,1)
D2 = rnorm(200,3,1)
D3= rnorm(200,1.5,2)
In order to created the mixed distribution we can use the rbinom function to create a vector of 1s and 0s according to a selected probability. This is a way to have some values from both distributions.
var_1_flag <- rbinom(200, size=1, prob = 0.3)
var_1 <- var_1_flag*D1 + (1 - var_1_flag)*D2
var_2_flag <- rbinom(200, size=1, prob = 0.7)
var_2 <- var_2_flag*D2 + (1 - var_2_flag)*D3
var_3_flag <- rbinom(200, size=1, prob = 0.6)
var_3 <- var_3_flag*D1 + (1 - var_3_flag)*D3
In order to see which values come from which distribution you can do the following:
var_1[var_1_flag] #This gives you the values in the mixed distribution that come from the first distribution (D1)
var1[!var_1_flag] #This gives you the values in the mixed distribution that come from the second distribution (D2)
Since I found this a bit manual and I am guessing you might want to change the variables, you might want to use the function below to get the same results
create_distr <- function(observations, mean1, sd1, mean2, sd2, flag_prob) {
flag <- rbinom(observations, size=1, prob = flag_prob)
my_distribution <- flag * rnorm(observations, mean1, sd1) + (1 - flag) * rnorm(observations, mean2, sd2)
}
var_1 <- create_distr(200, 2, 1, 3, 1, 0.5)
var_2 <- create_distr(200, 3, 1, 1.5, 2, 0.7)
var_3 <- create_distr(200, 2, 1, 1.5, 2, 0.6)
If you would like to have more than two variables (distributions) to the mix you could extend the code you have provided as follows:
N <- 100000
#Sample N random uniforms U
U <- runif(N)
#Variable to store the samples from the mixture distribution
rand.samples <- rep(NA,N)
for(i in 1:N) {
if(U[i] < 0.3) {
rand.samples[i] <- rnorm(1,1,3)
} else if (U[i] < 0.5){
rand.samples[i] <- rnorm(1,2,5)
} else if (U[i] < 0.8) {
rand.samples[i] <- rnorm(1,5,2)
} else {
rand.samples[i] <- rt(1, 2)
}
}
This way every element is taken one at a time from each distribution. If you want to have the same result but without taking each element one at a time you can do the following:
N <- 100000
#Sample N random uniforms U
U <- runif(N)
#Variable to store the samples from the mixture distribution
rand.samples <- rep(NA,N)
D1 = rnorm(N,1,3)
D2 = rnorm(N,2,5)
D3= rnorm(N,5,2)
D4 = rt(N, 2)
rand.samples <- c(D1[U < 0.3], D2[U >= 0.3 & U < 0.5], D3[U >= 0.5 & U < 0.8], D4[U >= 0.8])
Which corresponds to 0.3*normal(1,3) + 0.2*normal(2,5) + 0.3*normal(5,2) + 0.2*student(2 degrees of freedom)
If you want to create two mixtures, but in the second keep the same values from the normal distribution you can do the following:
mixture_1 <- c(D1[U < 0.3], D2[U >= 0.3 ])
mixture_2 <- c(D1[U < 0.3], D3[U >= 0.3])
This will use the exact same elements from normal(1,3) in both mixtures. The trick is to not recalculate the rnorm(N,1,3) every time you use it. And in both cases the samples are composed from 30% roughly coming from the first normal (D1) and 70% roughly from the second distribution. For example:
set.seed(1)
N <- 100000
U <- runif(N)
> prop.table(table(U < 0.3))
FALSE TRUE
0.6985 0.3015
30% of the values in the U vector is below 0.3.

increase the number of defaulters in a sample

I have a banking dataset which has 5% defaulters and the rest are good( non-defaulters).
I want to create a sample which has 30% defaulters , 70% non-defaulters.
Assuming my dataset is data and it has a column named "default" signifying 0 or 1, how do i get a sample with 30% default, 70% non-default given that my original dataset has only 5% default.
Can some one please provide the R code. That would be great.
I tried the following to get 100 random samples with replacement
data[sample(1:nrow(data),size=100,replace=TRUE),]
But how do i ensure that I get that the split is 30%,70%?
sample has an option prob that represents a vector of probability weights for obtaining the elements of the vector being sampled. So you could use prob=c(0.3,0.7) as a parameter to sample.
For example
sample(0:1, 100, replace=TRUE, prob=c(0.3,0.7))
Assume df is your dataframe and default is the column indicating who defaults.
To sample without replacement:
df[c(sample(which(df$default),30), sample(which(!df$default),70)),]
To sample with replacement (i.e., possibly duplicating records):
df[c(sample(which(df$default),30,TRUE), sample(which(!df$default),70,TRUE)),]
Alternatively, if you don't want to specify an exact number of defaulters and non-defaulters, you can specify a sampling probability for each row:
set.seed(1)
df <- data.frame(default=rbinom(250,1,.5), y=rnorm(250))
n <- 100 # could be any number, but closer you get to nrow(df) the less the weights matters
s <- sample(seq_along(df$default), n, prob=ifelse(df$default, .3, .7))
table(df$default[s])
#
# 0 1
# 61 39
n <- 150 # could be any number, but closer you get to nrow(df) the less the weights matters
s <- sample(seq_along(df$default), n, prob=ifelse(df$default, .3, .7))
table(df$default[s])
#
# 0 1
# 97 53

generate random integers between two values with a given probability using R

I have the following four number sets:
A=[1,207];
B=[208,386];
C=[387,486];
D=[487,586].
I need to generate 20000 random numbers between 1 and 586 in which the probability that the generated number belongs to A is 1/2 and to B,C,D is 1/6.
in which way I can do this using R?
You can directly use sample, more specifcally the probs argument. Just divide the probability over all the 586 numbers. Category A get's 0.5/207 weight each, etc.
A <- 1:207
B <- 208:386
C <- 387:486
D <- 487:586
L <- sapply(list(A, B, C, D), length)
x <- sample(c(A, B, C, D),
size = 20000,
prob = rep(c(1/2, 1/6, 1/6, 1/6) / L, L),
replace = TRUE)
I would say use the Roulette selection method. I will try to give a brief explanation here.
Take a line of say length 1 unit. Now break this in proportion of the probability values. So in our case, first piece will be of 1.2 length and next three pieces will be of 1/6 length. Now sample a number between 0,1 from uniform distribution. As all the number have same probability of occurring, a sampled number belonging to a piece will be equal to length of the piece. Hence which ever piece the number belongs too, sample from that vector. (I will give you the R code below you can run it for a huge number to check if what I am saying is true. I might not be doing a good job of explaining it here.)
It is called Roulette selection because another analogy for the same situation can be, take a circle and split it into sectors where the angle of each sector is proportional to the probability values. Now sample a number again from uniform distribution and see which sector it falls in and sample from that vector with the same probability
A <- 1:207
B <- 208:386
C <- 387:486
D <- 487:586
cumList <- list(A,B,C,D)
probVec <- c(1/2,1/6,1/6,1/6)
cumProbVec <- cumsum(probVec)
ret <- NULL
for( i in 1:20000){
rand <- runif(1)
whichVec <- which(rand < cumProbVec)[1]
ret <- c(ret,sample(cumList[[whichVec]],1))
}
#Testing the results
length(which(ret %in% A)) # Almost 1/2*20000 of the values
length(which(ret %in% B)) # Almost 1/6*20000 of the values
length(which(ret %in% C)) # Almost 1/6*20000 of the values
length(which(ret %in% D)) # Almost 1/6*20000 of the values

Efficiently generating discrete random numbers

I want to quickly generate discrete random numbers where I have a known CDF. Essentially, the algorithm is:
Construct the CDF vector (an increasing vector starting at 0 and end at 1) cdf
Generate a uniform(0, 1) random number u
If u < cdf[1] choose 1
else if u < cdf[2] choose 2
else if u < cdf[3] choose 3
*...
Example
First generate an cdf:
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
Next generate N uniform random numbers:
N = 1000
u = runif(N)
Now sample the value:
##With some experimenting this seemed to be very quick
##However, with N = 100000 we run out of memory
##N = 10^6 would be a reasonable maximum to cope with
colSums(sapply(u, ">", cdf))
If you know the probability mass function (which you do, if you know the cumulative distribution function), you can use R's built-in sample function, where you can define the probabilities of discrete events with argument prob.
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
system.time(sample(size=1e6,x=1:10000,prob=c(cdf[1],diff(cdf)),replace=TRUE))
user system elapsed
0.01 0.00 0.02
How about using cut:
N <- 1e6
u <- runif(N)
system.time(as.numeric(cut(u,cdf)))
user system elapsed
1.03 0.03 1.07
head(table(as.numeric(cut(u,cdf))))
1 2 3 4 5 6
51 95 165 172 148 75
If you have a finite number of possible values then you can use findInterval or cut or better sample as mentioned by #Hemmo.
However, if you want to generate data from a distribution that that theoretically goes to infinity (like the geometric, negative binomial, Poisson, etc.) then here is an algorithm that will work (this will also work with a finite number of values if wanted):
Start with your vector of uniform values and loop through the distribution values subtracting them from the vector of uniforms, the random value is the iteration where the value goes negative. This is a easier to see whith an example. This generates values from a Poisson with mean 5 (replace the dpois call with your calculated values) and compares it to using the inverse CDF (which is more efficient in this case where it exists).
i <- 0
tmp <- tmp2 <- runif(10000)
randvals <- rep(0, length(tmp) )
while( any(tmp > 0) ) {
tmp <- tmp - dpois(i, 5)
randvals <- randvals + (tmp > 0)
i <- i + 1
}
randvals2 <- qpois( tmp2, 5 )
all.equal(randvals, randvals2)

Resources