generate random integers between two values with a given probability using R - r

I have the following four number sets:
A=[1,207];
B=[208,386];
C=[387,486];
D=[487,586].
I need to generate 20000 random numbers between 1 and 586 in which the probability that the generated number belongs to A is 1/2 and to B,C,D is 1/6.
in which way I can do this using R?

You can directly use sample, more specifcally the probs argument. Just divide the probability over all the 586 numbers. Category A get's 0.5/207 weight each, etc.
A <- 1:207
B <- 208:386
C <- 387:486
D <- 487:586
L <- sapply(list(A, B, C, D), length)
x <- sample(c(A, B, C, D),
size = 20000,
prob = rep(c(1/2, 1/6, 1/6, 1/6) / L, L),
replace = TRUE)

I would say use the Roulette selection method. I will try to give a brief explanation here.
Take a line of say length 1 unit. Now break this in proportion of the probability values. So in our case, first piece will be of 1.2 length and next three pieces will be of 1/6 length. Now sample a number between 0,1 from uniform distribution. As all the number have same probability of occurring, a sampled number belonging to a piece will be equal to length of the piece. Hence which ever piece the number belongs too, sample from that vector. (I will give you the R code below you can run it for a huge number to check if what I am saying is true. I might not be doing a good job of explaining it here.)
It is called Roulette selection because another analogy for the same situation can be, take a circle and split it into sectors where the angle of each sector is proportional to the probability values. Now sample a number again from uniform distribution and see which sector it falls in and sample from that vector with the same probability
A <- 1:207
B <- 208:386
C <- 387:486
D <- 487:586
cumList <- list(A,B,C,D)
probVec <- c(1/2,1/6,1/6,1/6)
cumProbVec <- cumsum(probVec)
ret <- NULL
for( i in 1:20000){
rand <- runif(1)
whichVec <- which(rand < cumProbVec)[1]
ret <- c(ret,sample(cumList[[whichVec]],1))
}
#Testing the results
length(which(ret %in% A)) # Almost 1/2*20000 of the values
length(which(ret %in% B)) # Almost 1/6*20000 of the values
length(which(ret %in% C)) # Almost 1/6*20000 of the values
length(which(ret %in% D)) # Almost 1/6*20000 of the values

Related

R code Gaussian mixture -- numerical expression has 2 elements: only the first used

I'm trying to create a Gaussian Mix function according to these parameters:
For each sample, roll a die with k sides
If the j-th side appears from the roll, draw a sample from Normal(muj, sdj) where muj and sdj are the mean and standard deviation for the j-th Normal distribution respectively. This means you should have k different Normal distributions to choose from. Note that muj is the mathematical form of referring to the j-th element in a vector called mus.
The resulting sample from this Normal is then from a Gaussian Mixture.
Where:
n, an integer that represents the number of independent samples you want from this random variable
mus, a numeric vector with length k
sds, a numeric vector with length k
prob, a numeric vector with length k that indicates the probability of choosing the different Gaussians. This should have a default to NULL.
This is what I came up with so far:
n <- c(1)
mus <- c()
sds <- c()
prob <- c()
rgaussmix <- function(n, mus, sds, prob = NULL){
if(length(mus) != length(sds)){
stop("mus and sds have different lengths")
}
for(i in 1:seq_len(n)){
if(is.null(prob)){
rolls <- c(NA, n)
rolls <- sample(c(1:length(mus)), n, replace=TRUE)
avg <- rnorm(length(rolls), mean=mus[rolls], sd=sds[rolls])
}else{
rolls <- c(NA, n)
rolls <- sample(c(1:length(mus), n, replace=TRUE, p=prob))
avg <- rnorm(length(rolls), mean=mus[rolls], sd=sds[rolls])
}
}
return(avg)
}
rgaussmix(2, 1:3, 1:3)
It seems to match most of the requirements, but it keeps giving me the following error:
numerical expression has 2 elements: only the first usednumber of items to replace is not a multiple of replacement length
I've tried looking at the lengths of multiple variables, but I can't seem to figure out where the error is coming from!
Could someone please help me?
If you do seq_len(2) it gives you:
[1] 1 2
And you cannot do 1:(1:2) .. it doesn't make sense
Also you can avoid the loops in your code, by sampling the number of tries you need, for example if you do:
rnorm(3,c(0,10,20),1)
[1] -0.507961 8.568335 20.279245
It gives you 1st sample from the 1st mean, 2nd sample from 2nd mean and so on. So you can simplify your function to:
rgaussmix <- function(n, mus, sds, prob = NULL){
if(length(mus) != length(sds)){
stop("mus and sds have different lengths")
}
if(is.null(prob)){
prob = rep(1/length(mus),length(mus))
}
rolls <- sample(length(mus), n, replace=TRUE, p=prob)
avg <- rnorm(n, mean=mus[rolls], sd=sds[rolls])
avg
}
You can plot the results:
plot(density(rgaussmix(10000,c(0,5,10),c(1,1,1))),main="mixture of 0,5,10")

Faster way to generate large list of vectors from permuted datasets [R]

Setup For the purposes of my simulation, I'm generating a list of B=2000 elements, with each element being the output of a permutation procedure in which I first permute the rows of a 200x8000 matrix and for each column, I calculate the Kolmogorov-Smirnov test statistic between the first and second 100 rows (you can think of the first 100 rows as data from one group and the second 100 rows as data from another group).
Question This process takes a very long time (about 30-40 minutes) to generate the list. Is there a much faster way? In the future, I'd like to increase B to a larger value.
Code
B=2000
n.row=200; n.col=8000
#Generate sample data
samp.dat = matrix(rnorm(n.row*n.col),nrow=n.row)
perm.KS.list = NULL
for (b in 1:B){
#permute the rows
perm.dat.tmp = samp.dat[sample(nrow(samp.dat)),]
#Compute the permutation-based test statistics
perm.KS.list[[b]]= apply(perm.dat.tmp,2,function(y) ks.test.stat(y[1:100],y[101:200]))
}
#Modified KS-test function (from base package)
ks.test.stat <- function(x,y){
x <- x[!is.na(x)]
n <- length(x)
y <- y[!is.na(y)]
n.x <- as.double(n)
n.y <- length(y)
w <- c(x, y)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)] #exclude ties
STATISTIC <- max(abs(z))
return(STATISTIC)
}
The 1:B loop has several places to optimize, but I agree that the real consumer is that inner function. Because you're simulating your well-behaved bootstrap samples, you can make two simplifying assumptions that the general base function can't:
There aren't missing values. This obviates the is.na() adjustments
The two sides (ie, x & y) have the same number of elements, so you don't need to count them separately. instead of splitting y in the loop, and them joining them back in the function (into w), just keep it together. The balanced sides also permit simplifications like remove the ifelse() clause. It produces a bunch of 0/1s, which are rescaled to -1/1s with integer arithmetic.
The function is reduced, which saves about 25% of the time. I added integers, instead of doubles inside cumsum().
ks.test.stat.balanced <- function(w){
n <- as.integer(length(w) * .5)
# z <- cumsum(ifelse(order(w) <= n, 1L, -1L)) / n
z <- cumsum((order(w) <= n)*2L - 1L) / n
# z <- z[c(which(diff(sort(w)) != 0), n + n)] #exclude ties
return( max(abs(z)) )
}
Ties shouldn't occur often with your gaussian rng, and the diff(sort(.)) is very expensive. If you're willing to remove that protection, the time is reduced by about 65%.
If you move the equation for z into abs(), it saves a little time over all those reps. I kept it separate above, so it's easier to read.
edit in case of an unbalanced simulation I'd recommend you:
still keep out the is.na,
still pass w,
still keep as much as possible in integer, not numeric, but
now include arguments n1 & n2 for the two group sizes.
Also, experiment w/ precalculating 1/n before cumsum() to avoid a lot of expensive divisions. Try to think of other math-y ways to extract calculations from an inner loop so it occurs less frequently.

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

Matrix computation with for loop

I am newcomer to R, migrated from GAUSS because of the license verification issues.
I want to speed-up the following code which creates n×k matrix A. Given the n×1 vector x and vectors of parameters mu, sig (both of them k dimensional), A is created as A[i,j]=dnorm(x[i], mu[j], sigma[j]). Following code works ok for small numbers n=40, k=4, but slows down significantly when n is around 10^6 and k is about the same size as n^{1/3}.
I am doing simulation experiment to verify the bootstrap validity, so I need to repeatedly compute matrix A for #ofsimulation × #bootstrap times, and it becomes little time comsuming as I want to experiment with many different values of n,k. I vectorized the code as much as I could (thanks to vector argument of dnorm), but can I ask more speed up?
Preemptive thanks for any help.
x = rnorm(40)
mu = c(-1,0,4,5)
sig = c(2^2,0.5^2,2^2,3^2)
n = length(x)
k = length(mu)
A = matrix(NA,n,k)
for(j in 1:k){
A[,j]=dnorm(x,mu[j],sig[j])
}
Your method can be put into a function like this
A.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
A <- matrix(NA,n,k)
for(j in 1:k) A[,j] <- dnorm(x,mu[j],sig[j])
A
}
and it's clear that you are filling the matrix A column by column.
R stores the entries of a matrix columnwise (just like Fortran).
This means that the matrix can be filled with a single call of dnorm using suitable repetitions of x, mu, and sig. The vector z will have the columns of the desired matrix stacked. and then the matrix to be returned can be formed from that vector just by specifying the number of rows an columns. See the following function
B.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
z <- dnorm(rep(x,times=k),rep(mu,each=n),rep(sig,each=n))
B <- matrix(z,nrow=n,ncol=k)
B
}
Let's make an example with your data and test this as follows:
N <- 40
set.seed(11)
x <- rnorm(N)
mu <- c(-1,0,4,5)
sig <- c(2^2,0.5^2,2^2,3^2)
A <- A.fill(x,mu,sig)
B <- B.fill(x,mu,sig)
all.equal(A,B)
# [1] TRUE
I'm assuming that n is an integer multiple of k.
Addition
As noted in the comments B.fill is quite slow for large values of n.
The reason lies in the construct rep(...,each=...).
So is there a way to speed A.fill.
I tested this function:
C.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
sapply(1:k,function(j) dnorm(x,mu[j],sig[j]), simplify=TRUE)
}
This function is about 20% faster than A.fill.

R: Distribution of Random Samples vs. 1 Random Sample

I have a question about random sampling.
Are the two following results (A and B) statistically the same?
nobs <- 1000
A <- rt(n=nobs, df=3, ncp=0)
simulations <- 50
B <- unlist(lapply(rep.int(nobs/simulations, times=simulations),function(y) rt(n=y, df=3, ncp=0) ))
I thought it would be but now I've been going back and forth.
Any help would be appreciated.
Thanks
With some small changes, you can even make them numerically equal. You only need to seed the RNG and omit specifying the ncp parameter and use the default value (of 0) instead:
nobs <- 1000
set.seed(42)
A <- rt(n=nobs, df=3)
simulations <- 50
set.seed(42)
B <- unlist(lapply(rep.int(nobs/simulations, times=simulations),function(y) rt(n=y, df=3) ))
all.equal(A, B)
#[1] TRUE
Why don't you get equal results when you specify ncp=0?
Because then rt assumes that you actually want a non-central t-distribution and the values are calculated with rnorm(n, ncp)/sqrt(rchisq(n, df)/df). That means when creating 1000 values at once rnorm is called once and rchisq is called once subsequently. If you create 50 times 20 values, you have alternating calls to these RNGs, which means the RNG states are different for the rnorm and rchisq calls than in the first case.

Resources