Simulate mixture data with different mix dependecies structure between each two variables?

Simulate mixture data with different mix dependecies structure between each two variables? - r

I would like to simulate a mixture data, say 3 dimensional data. I would like to have 2 different components between each two variables.
That is, simulate mixture data (V1 and V2) where the dependencies between them is two different normal components. Then, between V2 and V3 another two normal components. So, I will have 3d data, the dependence between first and a second variable are a mixture of two normals. And the dependence between the second and third variable are mixture of another two different components.
Another way to explain my question:
Suppose I would like to generate a mixture data as follows:
1- 0.3 normal(0.5,1) + 0.7 normal(2,4) # hence here I will get a bivariate mixture data generated from two different normal (two components of the mixture model), the sum of mixuter weight is 1.
Then, I would like to get another variable as follows:
2- 0.5 normal(2,4) # this is the second variable on the first simulate + 0.5 normal(2,6)
so here, I get 3d simulated mixture data, where V1 and V2 are generated by two different mixture components, and the V2 and V3 are generated by another different mixture components.
This is how to generate the data in r: ( I belive it is not generate a bivariate data)
N <- 100000
#Sample N random uniforms U
U <- runif(N)
#Variable to store the samples from the mixture distribution
rand.samples <- rep(NA,N)
#Sampling from the mixture
for(i in 1:N) {
if(U[i]<.3) {
rand.samples[i] <- rnorm(1,1,3)
} else {
rand.samples[i] <- rnorm(1,2,5)
}
}
so if we generate mixture bivariate data (two variables) then how can extend this to have 4 or 5 variables, where V1 and V2 are generated from two different normals (the dependencies structures between them is a mixture of two normals) and then V3 will generated from another another differetn normal and then compine with V2. That is when we plot the V2 ~ V3 we will find that the dependencies structures between them is a mixutre of two normals and so on.

I am not really sure I have correctly understood the question but I will give it a try. You have 3 distributions D1, D2 and D3. From these three distributions you would like to create variables that use 2 out of those 3 but not the same ones.
Since I do not know how the distributions should be combined I used the flags using the binomial distribution (its a vector of length equal to 200 with 0s and 1s) to determine from which distribution each value will be picked (You can change that if that is not how you want it done).
D1 = rnorm(200,2,1)
D2 = rnorm(200,3,1)
D3= rnorm(200,1.5,2)
In order to created the mixed distribution we can use the rbinom function to create a vector of 1s and 0s according to a selected probability. This is a way to have some values from both distributions.
var_1_flag <- rbinom(200, size=1, prob = 0.3)
var_1 <- var_1_flag*D1 + (1 - var_1_flag)*D2
var_2_flag <- rbinom(200, size=1, prob = 0.7)
var_2 <- var_2_flag*D2 + (1 - var_2_flag)*D3
var_3_flag <- rbinom(200, size=1, prob = 0.6)
var_3 <- var_3_flag*D1 + (1 - var_3_flag)*D3
In order to see which values come from which distribution you can do the following:
var_1[var_1_flag] #This gives you the values in the mixed distribution that come from the first distribution (D1)
var1[!var_1_flag] #This gives you the values in the mixed distribution that come from the second distribution (D2)
Since I found this a bit manual and I am guessing you might want to change the variables, you might want to use the function below to get the same results
create_distr <- function(observations, mean1, sd1, mean2, sd2, flag_prob) {
flag <- rbinom(observations, size=1, prob = flag_prob)
my_distribution <- flag * rnorm(observations, mean1, sd1) + (1 - flag) * rnorm(observations, mean2, sd2)
}
var_1 <- create_distr(200, 2, 1, 3, 1, 0.5)
var_2 <- create_distr(200, 3, 1, 1.5, 2, 0.7)
var_3 <- create_distr(200, 2, 1, 1.5, 2, 0.6)
If you would like to have more than two variables (distributions) to the mix you could extend the code you have provided as follows:
N <- 100000
#Sample N random uniforms U
U <- runif(N)
#Variable to store the samples from the mixture distribution
rand.samples <- rep(NA,N)
for(i in 1:N) {
if(U[i] < 0.3) {
rand.samples[i] <- rnorm(1,1,3)
} else if (U[i] < 0.5){
rand.samples[i] <- rnorm(1,2,5)
} else if (U[i] < 0.8) {
rand.samples[i] <- rnorm(1,5,2)
} else {
rand.samples[i] <- rt(1, 2)
}
}
This way every element is taken one at a time from each distribution. If you want to have the same result but without taking each element one at a time you can do the following:
N <- 100000
#Sample N random uniforms U
U <- runif(N)
#Variable to store the samples from the mixture distribution
rand.samples <- rep(NA,N)
D1 = rnorm(N,1,3)
D2 = rnorm(N,2,5)
D3= rnorm(N,5,2)
D4 = rt(N, 2)
rand.samples <- c(D1[U < 0.3], D2[U >= 0.3 & U < 0.5], D3[U >= 0.5 & U < 0.8], D4[U >= 0.8])
Which corresponds to 0.3*normal(1,3) + 0.2*normal(2,5) + 0.3*normal(5,2) + 0.2*student(2 degrees of freedom)
If you want to create two mixtures, but in the second keep the same values from the normal distribution you can do the following:
mixture_1 <- c(D1[U < 0.3], D2[U >= 0.3 ])
mixture_2 <- c(D1[U < 0.3], D3[U >= 0.3])
This will use the exact same elements from normal(1,3) in both mixtures. The trick is to not recalculate the rnorm(N,1,3) every time you use it. And in both cases the samples are composed from 30% roughly coming from the first normal (D1) and 70% roughly from the second distribution. For example:
set.seed(1)
N <- 100000
U <- runif(N)
> prop.table(table(U < 0.3))
FALSE TRUE
0.6985 0.3015
30% of the values in the U vector is below 0.3.

Related

How to generate spatially correlated random fields of very high dimension with R

This is an extended question I found from here (Method #1: http://santiago.begueria.es/2010/10/generating-spatially-correlated-random-fields-with-r/) and here (Method #2: https://gist.github.com/brentp/1306786). I know these two sites covered very well (Thanks!) with relatively small size of dimension (e.g., 1000x1). I am trying to generate spatially clustered binary data with large size of dimension like >=100000x1 dimension, for example, c(1,1,1,1,0,1,0,0,0,0, …, 0,0,0,0,0,0,0,0,0,0,0,0) with 1000 times / case study. Here are slightly modified codes from the sites.
# Method #1
dim1 <- 1000
dim2 <- 1
xy <- expand.grid(seq_len(dim1), seq_len(dim2))
colnames(xy) <- c("x", "y")
geo.model <- gstat(formula = z~x+y, locations = ~x+y, dummy = TRUE, beta = 0,
model = vgm(psill = 1,"Exp",
range = dim1), # Range parameter!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
nmax = 30) # Spatial correlation model
sim.mat <- predict(geo.model, newdata = xy, nsim = 1)
sim.mat[,3] <- ifelse(sim.mat[,3] > quantile(sim.mat[,3], .1), 0, 1)
plot(sim.mat[, 3])
# Method #2
# generate autocorrelated data.
nLags = 1000 # number of lags (size of region)
# fake, uncorrelated observations
X = rnorm(nLags)
# fake sigma... correlated decreases distance.
sigma = diag(nLags)
corr = .999
sigma <- corr ^ abs(row(sigma)-col(sigma))
#sigma
# Y is autocorrelated...
Y <- t(X %*% chol(sigma))
y <- ifelse(Y >= quantile(Y, probs=.9), 1, 0)[, 1]
plot(y)
Both methods work very well to generate binary data when dim1 is less than 10000. However, when I tried several hundred thousand (e.g., >= 100,000), it seems to take a long time or memory issue.
For example, when I used “nLags = 50000” in Method #2, I got an error message (“Error: cannot allocate vector of size 9.3 Gb”) after the code “sigma <- corr ^ abs(row(sigma)-col(sigma))”.
I would like to find an efficient (time- and memory-saving) way to generate such a spatially clustered binary data 1000 times (especially, with dim1 >= 100000) per each case study (about 200 cases).
I have thought about applying multiple probabilities in "sample" function or probability distribution. I am not sure how to and beyond my scope.

R: select a subset based on probability

I'm new to R. I have a normal distribution.
n <- rnorm(1000, mean=10, sd=2)
As an exercise I'd like to create a subset based on a probability curve derived from the values. E.g for values <5, I'd like to keep random 25% entries, for values >15, I'd like to keep 75% random entries, and for values between 5 and 15, I'd like to linearly interpolate the probability of selection between 25% and 75%. Seems like what I want is the "sample" command and its "prob" option, but I'm not clear on the syntax.

For the first two subsets we may use
idx1 <- n < 5
ss1 <- n[idx1][sample(sum(idx1), sum(idx1) * 0.25)]
idx2 <- n > 15
ss2 <- n[idx2][sample(sum(idx2), sum(idx2) * 0.75)]
while for the third one,
idx3 <- !idx1 & !idx2
probs <- (n[idx3] - 5) / 10 * (0.75 - 0.25) + 0.25
ss3 <- n[idx3][sapply(probs, function(p) sample(c(TRUE, FALSE), 1, prob = c(p, 1 - p)))]
where probs are linearly interpolated probabilities for each of element of n[idx3]. Then using sapply we draw TRUE (take) or FALSE (don't take) for each of those elements.

The prob option in sample() gives weigths of probability to the vector to sample.
https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample
So if I understood the question right what you want is to sample only 25% of the values < 5 and 75% for values > 75 and so on ..
Then you have to use the n parameter
As documentation says
n
a positive number, the number of items to choose from. See ‘Details.’
There you could input the % of sample you want multiplied by the length of the sample vector.
For your last sample you could add a uniform variable to run from .25 to .75 runif()
Hope this helps!

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!

Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457

If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

How to generate data for gaussian distributions in these 2 scenarios in R?

In "Elements of Statistical Learning" by Tibshirani, when comparing least squares/linear models and knn these 2 scenarios are stated:
Scenario 1: The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means.
Scenario 2: The training data in each class came from a mixture of 10
low- variance Gaussian distributions, with individual means themselves
distributed as Gaussian.
The idea is that the first is better suited for least squares/linear models and the second for knn like models (those with higher variance from what i understand since knn takes into account the closest points and not all points).
In R, how would I simulate data for both scenarios?
The end goal is to be able to reproduce both scenarios in order to prove that effectively the 1st one is better explained by the linear model than the 2nd one.
Thanks!

This could be scenario 1
library(mvtnorm)
N1 = 50
N2 = 50
K = 2
mu1 = c(-1,3)
mu2 = c(2,0)
cov1 = 0
v11 = 2
v12 = 2
Sigma1 = matrix(c(v11,cov1,cov1,v12),nrow=2)
cov2 = 0
v21 = 2
v22 = 2
Sigma2 = matrix(c(v21,cov2,cov2,v22),nrow=2)
x1 = rmvnorm(N1,mu1,Sigma1)
x2 = rmvnorm(N2,mu2,Sigma2)
This could be a candidate for simulating from a Gaussian mixture:
BartSimpson <- function(x,n = 100){
means <- as.matrix(sort(rnorm(10)))
dens <- .1*rowSums(apply(means,1,dnorm,x=x,sd=.1))
rBartSimpson <- c(apply(means,1,rnorm,n=n/10,sd=.1))
return(list("thedensity" = dens,"draws" = rBartSimpson))
}
x <- seq(-5,5,by=.01)
plot(x,BartSimpson(x)$thedensity,type="l",lwd=4,col="yellow2",xlim=c(-4,4),ylim=c(0,0.6))

In the code below, I first create the 10 different means of the classes, and then use the means to draw random values from those means. The code is identical for the two scenarios, but you'll have to adjust the variance within and between classes to get the results you want.
Scenario 1:
Here you want to generate 10 classes with different means (I assume the means follow a bivariate gaussian distribution). The difference between classes is much less than the difference within classes.
library(MASS)
n <- 20
# subjects per class
classes <- 10
# number of classes
mean <- 100
# mean value for all classes
var.between <- 25
# variation between classes
var.within <- 225
# variation within classes
covmatrix1 <- matrix(c(var.between,0,0,var.between), nrow=2)
# covariance matrix for the classes
means <- mvrnorm(classes, c(100,100), Sigma=covmatrix1)
# creates the means for the two variables for each class using variance between classes
covmatrix2 <- matrix(c(var.within,0,0,var.within), nrow=2)
# creates a covariance matrix for the subjects
class <- NULL
values <- NULL
for (i in 1:10) {
temp <- mvrnorm(n, c(means[i], means[i+classes]), Sigma=covmatrix2)
class <- c(class, rep(i, n))
values <- c(values, temp)
}
# this loop uses generates data for each class based on the class means and variance within classes
valuematrix <- matrix(values, nrow=(n*classes))
data <- data.frame (class, valuematrix)
plot(data$X1, data$X2)
Alternatively, if you don't care about specifying the variance between the classes, and you don't want any correlation within classes, you can just do this:
covmatrix <- matrix(c(225, 0, 0, 225), nrow=2)
# specifies that the variance in both groups is 225 and no covariance
values <- matrix(mvrnorm(200, c(100,100), Sigma=covmatrix), nrow=200)
# creates a matrix of 200 individuals with two values each.
Scenario 2:
Here the only difference is that the variation between classes is larger than the variation within classes. Try exchanging the value of the variable var.between to around 500 and the variable var.within to 25 and you'll see a clear clustering in the scatterplot:
n <- 20
# subjects per class
classes <- 10
# number of classes
mean <- 100
# mean value for all classes
var.between <- 500
# variation between classes
var.within <- 25
# variation within classes
covmatrix1 <- matrix(c(var.between,0,0,var.between), nrow=2)
# covariance matrix for the classes
means <- mvrnorm(classes, c(100,100), Sigma=covmatrix1)
# creates the means for the two variables for each class using variance between classes
covmatrix2 <- matrix(c(var.within,0,0,var.within), nrow=2)
# creates a covariance matrix for the subjects
class <- NULL
values <- NULL
for (i in 1:10) {
temp <- mvrnorm(n, c(means[i], means[i+classes]), Sigma=covmatrix2)
class <- c(class, rep(i, n))
values <- c(values, temp)
}
# this loop uses generates data for each class based on the class means and variance within classes
valuematrix <- matrix(values, nrow=(n*classes))
data <- data.frame (class, valuematrix)
plot(data$X1, data$X2)
The plot should confirm that the data are clustered.
Hope this helps!

With the help from both answers here I ended up using this:
mixed_dists = function(n, n_means, var=0.2) {
means = rnorm(n_means, mean=1, sd=2)
values <- NULL
class <- NULL
for (i in 1:n_means) {
temp <- rnorm(n/n_means, mean=means[i], sd=0.2)
class <- c(class, rep(i, n/n_means))
values <- c(values, temp)
}
return(list(values, class));
}
N = 100
#Scenario 1: The training data in each class were generated from bivariate Gaussian distributions
#with uncorrelated components and different means.
scenario1 = function () {
var = 0.5
n_groups = 2
m = mixed_dists(N, n_groups, var=var)
x = m[[1]]
group = m[[2]]
y = mixed_dists(N, n_groups, var=var)[[1]]
data = matrix(c(x,y, group), nrow=N, ncol=3)
colnames(data) = c("x", "y", "group")
data = data.frame(data)
plot(x=data$x,y=data$y, col=data$group)
model = lm(y~x, data=data)
summary(model)
}
#Scenario 2: The training data in each class came from a mixture of 10
#low-variance Gaussian distributions, with individual means themselves
#distributed as Gaussian.
scenario2 = function () {
var = 0.2 # low variance
n_groups = 10
m = mixed_dists(N, n_groups, var=var)
x = m[[1]]
group = m[[2]]
y = mixed_dists(N, n_groups, var=var)[[1]]
data = matrix(c(x,y, group), nrow=N, ncol=3)
colnames(data) = c("x", "y", "group")
data = data.frame(data)
plot(x=data$x,y=data$y, col=data$group)
model = lm(y~x, data=data)
summary(model)
}
# scenario1()
# scenario2()
So basically the data in scenario 1 is cleanly separated in 2 classes and the data in scenario 2 has about 10 clusters and can't be cleanly separated using a straight line. Indeed, running the linear model on both scenarios it can be seen that on average it will apply better to scenario 1 than to scenario 2.

Generating samples from a two-Gaussian mixture in r (code given in MATLAB)

I'm trying to create (in r) the equivalent to the following MATLAB function that will generate n samples from a mixture of N(m1,(s1)^2) and N(m2, (s2)^2) with a fraction, alpha, from the first Gaussian.
I have a start, but the results are notably different between MATLAB and R (i.e., the MATLAB results give occasional values of +-8 but the R version never even gives a value of +-5). Please help me sort out what is wrong here. Thanks :-)
For Example:
Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one.
MATLAB
function
function y = gaussmix(n,m1,m2,s1,s2,alpha)
y = zeros(n,1);
U = rand(n,1);
I = (U < alpha)
y = I.*(randn(n,1)*s1+m1) + (1-I).*(randn(n,1)*s2 + m2);
implementation
P = gaussmix(1000,0,0,1,6,.95)
P = (P-mean(P))/std(P)
plot(P)
axis([0 1000 -15 15])
hist(P)
axis([-15 15 0 1000])
resulting plot
resulting hist
R
yn <- rbinom(1000, 1, .95)
s <- rnorm(1000, 0 + 0*yn, 1 + 36*yn)
sn <- (s-mean(s))/sd(s)
plot(sn, xlim=range(0,1000), ylim=range(-15,15))
hist(sn, xlim=range(-15,15), ylim=range(0,1000))
resulting plot
resulting hist
As always, THANK YOU!
SOLUTION
gaussmix <- function(nsim,mean_1,mean_2,std_1,std_2,alpha){
U <- runif(nsim)
I <- as.numeric(U<alpha)
y <- I*rnorm(nsim,mean=mean_1,sd=std_1)+
(1-I)*rnorm(nsim,mean=mean_2,sd=std_2)
return(y)
}
z1 <- gaussmix(1000,0,0,1,6,0.95)
z1_standardized <- (z1-mean(z1))/sqrt(var(z1))
z2 <- gaussmix(1000,0,3,1,1,0.80)
z2_standardized <- (z2-mean(z2))/sqrt(var(z2))
z3 <- rlnorm(1000)
z3_standardized <- (z3-mean(z3))/sqrt(var(z3))
par(mfrow=c(2,3))
hist(z1_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of 95% of N(0,1) and 5% of N(0,36)",
col="blue",xlab=" ")
hist(z2_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of 80% of N(0,1) and 10% of N(3,1)",
col="blue",xlab=" ")
hist(z3_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of samples of LN(0,1)",col="blue",xlab=" ")
##
plot(z1_standardized,type='l',
main="1000 samples from a mixture N(0,1) and N(0,36)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z2_standardized,type='l',
main="1000 samples from a mixture N(0,1) and N(3,1)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z3_standardized,type='l',
main="1000 samples from LN(0,1)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))

There are two problems, I think ... (1) your R code is creating a mixture of normal distributions with standard deviations of 1 and 37. (2) By setting prob equal to alpha in your rbinom() call, you're getting a fraction alpha in the second mode rather than the first. So what you are getting is a distribution that is mostly a Gaussian with sd 37, contaminated by a 5% mixture of Gaussian with sd 1, rather than a Gaussian with sd 1 that is contaminated by a 5% mixture of a Gaussian with sd 6. Scaling by the standard deviation of the mixture (which is about 36.6) basically reduces it to a standard Gaussian with a slight bump near the origin ...
(The other answers posted here do solve your problem perfectly well, but I thought you might be interested in a diagnosis ...)
A more compact (and perhaps more idiomatic) version of your Matlab gaussmix function (I think runif(n)<alpha is slightly more efficient than rbinom(n,size=1,prob=alpha) )
gaussmix <- function(n,m1,m2,s1,s2,alpha) {
I <- runif(n)<alpha
rnorm(n,mean=ifelse(I,m1,m2),sd=ifelse(I,s1,s2))
}
set.seed(1001)
s <- gaussmix(1000,0,0,1,6,0.95)

Not that you asked for it, but the mclust package offers a way to generalize your problem to more dimensions and diverse covariance structures. See ?mclust::sim. The example task would be done this way:
require(mclust)
simdata = sim(modelName = "V",
parameters = list(pro = c(0.95, 0.05),
mean = c(0, 0),
variance = list(modelName = "V",
d = 1,
G = 2,
sigmasq = c(0, 36))),
n = 1000)
plot(scale(simdata[,2]), type = "h")

I recently wrote the density and sampling function of a multinomial mixture of normal distributions:
dmultiNorm <- function(x,means,sds,weights)
{
if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
N <- length(x)
n <- length(means)
if (missing(weights))
{
weights <- rep(1,n)
}
if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
weights <- weights/sum(weights)
dens <- numeric(N)
for (i in 1:n)
{
dens <- dens + weights[i] * dnorm(x,means[i],sds[i])
}
return(dens)
}
rmultiNorm <- function(N,means,sds,weights,scale=TRUE)
{
if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
n <- length(means)
if (missing(weights))
{
weights <- rep(1,n)
}
if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
Res <- numeric(N)
for (i in 1:N)
{
s <- sample(1:n,1,prob=weights)
Res[i] <- rnorm(1,means[s],sds[s])
}
return(Res)
}
With means being a vector of means, sds being a vector of standard deviatians and weights being a vector with proportional probabilities to sample from each of the distributions. Is this useful to you?

Here is code to do this task:
"For Example: Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one."
plot(multG <- c( rnorm(950), rnorm(50, 0, 36))[sample(1000)] , type="h")
scmulG <- scale(multG)
summary(scmulG)
#-----------
V1
Min. :-9.01845
1st Qu.:-0.06544
Median : 0.03841
Mean : 0.00000
3rd Qu.: 0.13940
Max. :12.33107

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Simulate mixture data with different mix dependecies structure between each two variables? - r

Related

How to generate spatially correlated random fields of very high dimension with R

R: select a subset based on probability

Generate random values in R with a defined correlation in a defined range

How to generate data for gaussian distributions in these 2 scenarios in R?

Generating samples from a two-Gaussian mixture in r (code given in MATLAB)

Categories

Resources