How can I simulate m random samples of size n from a given distribution with R? - r

I know how to generate a random sample of size n from a standard statistical distribution, say exponential. But if I want to generate m such random samples of size n (i.e. m vectors of dimension n) how can I do it?

To create a n by m matrix containing m samples of size n you can use:
x <- replicate(m, rnorm(n, ...))
Obviously substituting rnorm with other distributions if desired. If you then want to store these in separate individual vectors then you can use
v <- x[ , i]
This puts the ith column of x into v, which corresponds to the ith sample. It may be easier/quicker to just use a simple for loop altogether though:
for(i in 1:m){
name <- paste("V", i, sep = "")
assign(name, rnorm(n, ...))
}
This generates a random sample at each iteration, and for stage i, names the sample Vi. By the end of it you'll have m random samples named V1, V2, ..., Vm.

Related

R code Gaussian mixture -- numerical expression has 2 elements: only the first used

I'm trying to create a Gaussian Mix function according to these parameters:
For each sample, roll a die with k sides
If the j-th side appears from the roll, draw a sample from Normal(muj, sdj) where muj and sdj are the mean and standard deviation for the j-th Normal distribution respectively. This means you should have k different Normal distributions to choose from. Note that muj is the mathematical form of referring to the j-th element in a vector called mus.
The resulting sample from this Normal is then from a Gaussian Mixture.
Where:
n, an integer that represents the number of independent samples you want from this random variable
mus, a numeric vector with length k
sds, a numeric vector with length k
prob, a numeric vector with length k that indicates the probability of choosing the different Gaussians. This should have a default to NULL.
This is what I came up with so far:
n <- c(1)
mus <- c()
sds <- c()
prob <- c()
rgaussmix <- function(n, mus, sds, prob = NULL){
if(length(mus) != length(sds)){
stop("mus and sds have different lengths")
}
for(i in 1:seq_len(n)){
if(is.null(prob)){
rolls <- c(NA, n)
rolls <- sample(c(1:length(mus)), n, replace=TRUE)
avg <- rnorm(length(rolls), mean=mus[rolls], sd=sds[rolls])
}else{
rolls <- c(NA, n)
rolls <- sample(c(1:length(mus), n, replace=TRUE, p=prob))
avg <- rnorm(length(rolls), mean=mus[rolls], sd=sds[rolls])
}
}
return(avg)
}
rgaussmix(2, 1:3, 1:3)
It seems to match most of the requirements, but it keeps giving me the following error:
numerical expression has 2 elements: only the first usednumber of items to replace is not a multiple of replacement length
I've tried looking at the lengths of multiple variables, but I can't seem to figure out where the error is coming from!
Could someone please help me?
If you do seq_len(2) it gives you:
[1] 1 2
And you cannot do 1:(1:2) .. it doesn't make sense
Also you can avoid the loops in your code, by sampling the number of tries you need, for example if you do:
rnorm(3,c(0,10,20),1)
[1] -0.507961 8.568335 20.279245
It gives you 1st sample from the 1st mean, 2nd sample from 2nd mean and so on. So you can simplify your function to:
rgaussmix <- function(n, mus, sds, prob = NULL){
if(length(mus) != length(sds)){
stop("mus and sds have different lengths")
}
if(is.null(prob)){
prob = rep(1/length(mus),length(mus))
}
rolls <- sample(length(mus), n, replace=TRUE, p=prob)
avg <- rnorm(n, mean=mus[rolls], sd=sds[rolls])
avg
}
You can plot the results:
plot(density(rgaussmix(10000,c(0,5,10),c(1,1,1))),main="mixture of 0,5,10")

Computing Spearman's rho for increasing subsets of rows in for Loop

I am trying to fit a for Loop in R in order to run correlations for multiple subsets in a data frame and then store the results in a vector.
What I have in this loop is a data frame with 2 columns, x and y, and 30 rows of different continuous measurement values in each column. The process should be repeated 100 times. The data can be invented.
What I need, is to compute the Spearman's rho for the first five rows (between x and y) and then for increasing subsets (e.g., the sixth first rows, the sevenths first rows etc.). Then, I'd need to store the rho results in a vector that I can further use.
What I had in mind (but does not work):
sortvector <- 1:(30)
for (i in 1:100)
{
sortvector <- sample(sortvector, replace = F)
xtemp <- x[sortvector]
rho <- cor.test(xtemp,y, method="spearman")$estimate
}
The problem is that the code gives me one value of rho for the whole dataframe, but I need it for increments of subsets.
How can I get rho for subsets of increasing values in a for-loop? And how can i store the coefficients in a vector that i can use afterwards?
Any help would be much appreciated, thanks.
Cheers
The easiest approach is to convertfor loop into sapply function, which returns a vector of rho's as a result of your bootstrapping:
sortvector <- 1:(30)
x <- rnorm(30)
y <- rnorm(30)
rho <- sapply(1:100, function(i) {
sortvector <- sample(sortvector, replace = F)
xtemp <- x[sortvector]
cor.test(xtemp, y, method = "spearman")$estimate
})
head(rho)
Output:
rho rho rho rho rho rho
0.014460512 -0.239599555 0.003337041 -0.126585095 0.007341491 0.264516129

Matrix computation with for loop

I am newcomer to R, migrated from GAUSS because of the license verification issues.
I want to speed-up the following code which creates n×k matrix A. Given the n×1 vector x and vectors of parameters mu, sig (both of them k dimensional), A is created as A[i,j]=dnorm(x[i], mu[j], sigma[j]). Following code works ok for small numbers n=40, k=4, but slows down significantly when n is around 10^6 and k is about the same size as n^{1/3}.
I am doing simulation experiment to verify the bootstrap validity, so I need to repeatedly compute matrix A for #ofsimulation × #bootstrap times, and it becomes little time comsuming as I want to experiment with many different values of n,k. I vectorized the code as much as I could (thanks to vector argument of dnorm), but can I ask more speed up?
Preemptive thanks for any help.
x = rnorm(40)
mu = c(-1,0,4,5)
sig = c(2^2,0.5^2,2^2,3^2)
n = length(x)
k = length(mu)
A = matrix(NA,n,k)
for(j in 1:k){
A[,j]=dnorm(x,mu[j],sig[j])
}
Your method can be put into a function like this
A.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
A <- matrix(NA,n,k)
for(j in 1:k) A[,j] <- dnorm(x,mu[j],sig[j])
A
}
and it's clear that you are filling the matrix A column by column.
R stores the entries of a matrix columnwise (just like Fortran).
This means that the matrix can be filled with a single call of dnorm using suitable repetitions of x, mu, and sig. The vector z will have the columns of the desired matrix stacked. and then the matrix to be returned can be formed from that vector just by specifying the number of rows an columns. See the following function
B.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
z <- dnorm(rep(x,times=k),rep(mu,each=n),rep(sig,each=n))
B <- matrix(z,nrow=n,ncol=k)
B
}
Let's make an example with your data and test this as follows:
N <- 40
set.seed(11)
x <- rnorm(N)
mu <- c(-1,0,4,5)
sig <- c(2^2,0.5^2,2^2,3^2)
A <- A.fill(x,mu,sig)
B <- B.fill(x,mu,sig)
all.equal(A,B)
# [1] TRUE
I'm assuming that n is an integer multiple of k.
Addition
As noted in the comments B.fill is quite slow for large values of n.
The reason lies in the construct rep(...,each=...).
So is there a way to speed A.fill.
I tested this function:
C.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
sapply(1:k,function(j) dnorm(x,mu[j],sig[j]), simplify=TRUE)
}
This function is about 20% faster than A.fill.

Estimate a probability from elements of a list in R

I have a list of 100,000 simulated numbers of T in R (min: 1.5, max 88.8) and I want to calculate the probability of T being between 10 and 50.
I sumulated 100,000 numbers of T, where T is t(y) %*% M %*% y where M is a 8x8 matrix of constant values and y is a 8x1 matrix. The element in the i-th row if y, is equal to: a_i + b_i where a is a vector of constants and b is a vector whose elements follow a normal (0,sd=2) distribution (each element is a different simulated number of N(0,2) )
Is it in a vector or a list? If it's a vector, the following should work. If it's in a list, you may use unlist() to convert it to a vector.
mylist <- runif(100000,1.5,88.8) #this is just to generate a random number vector
length(which(mylist>=10 & mylist<=50))/length(mylist)
set.seed(42)
myrandoms <- rnorm(100000, mean=5, sd=2)
mydistr <- ecdf(myrandoms)
#probability of beeing between 1 and 3:
diff(mydistr(c(1, 3)))
#[1] 0.13781
#compare with normal distribution
diff(pnorm(c(1, 3), mean=5, sd=2))
#[1] 0.1359051
If you really have a list, use myrandoms <- do.call(c, mylist) to make it a vector.

compute samples variance without loops

Here is what I want to do:
I have a time series data frame with let us say 100 time-series of
length 600 - each in one column of the data frame.
I want to pick up 10 of the time-series randomly and then assign them
random weights that sum up to one. Using those I want to compute the
variance of the sum of the 10 weighted time series variables (e.g.
convex combination).
The df is in the form
v1,v2,v2.....v100
1,5,6,.......9
2,4,6,.......10
3,5,8,.......6
2,2,8,.......2
etc
i can compute it inside a loop but r is vector oriented and it is not efficient.
ntrials = 10000
ts.sd = NULL
for (x in 1:ntrials))
{
temp = t(weights[,x]) %*% cov(df[, samples[, x]]) %*% weights[, x]
ts.sd = cbind(ts.sd, temp)
}
Not sure what type of "random" you want for your weights... so I'll use a normal distribution scaled s.t. it sums to one:
x=as.data.frame(matrix(sample(1:20, 100*600, replace=TRUE), ncol=100))
myfun <- function(inc, DF=x) {
w = runif(10)
w = w / sum(w)
t(w) %*% cov(DF[, sample(seq_along(DF), 10)]) %*% w
}
lapply(1:ntrials, myfun)
However, this isn't really avoiding loops per say since lapply is just an efficient looping construct. That said, for loops in R aren't explicitly bad or inefficient. Growing a data structure, like you're doing with cbind, however, is.
But in this case since you're only growing it by appending a single element it really wont change things much. The "correct" version would be to pre-allocate your vector ts.sd using ntrials.
ts.sd = vector(mode='numeric', length=ntrials)
The in your loop assign into it using i:
for (x in 1:ntrials))
{
temp = t(weights[,x]) %*% cov(df[, samples[, x]]) %*% weights[, x]
ts.sd[i] = temp
}

Resources