I am trying to generate data from a multinomial distribution in R using the function rmultinom, but I am having some problems.
The fact is that I want a data frame of 50 rows and 20 columns and a total sum of the outcomes equal to 3 times n*p.
I am using this code:
p <- 20
n <- 50
N <- 3*(n*p)
prob_true <- rep(1/p, p)
a <- rmultinom(50, N, prob_true)
But I get some very strange results and a data frame with 20 rows and 50 columns.
How can I solve this problem?
Thanks in advance!
The help available at ?rmultinom says that n in rmultinom(n, size, prob) is:
"number of random vectors to draw"
And size is:
"specifying the total number of objects that are put into K boxes in the typical multinomial experiment"
And the help says that the output is:
"For rmultinom(), an integer K x n matrix where each column is a random vector generated according to the desired multinomial law, and hence summing to size"
So you're asking for 50 vectors/variables with a total number of "objects" equal to 3000, so each column is drawn as a vector that sums to 3000.
colSums(a) does result in 3000.
Do you want your vectors/variables as rows? Then this would work just by transposing a:
t(a)
but if you want 20 columns, each that is its own variable, you would need to switch your n and p (I also subbed in n in the rmultinom call):
n <- 20
p <- 50
N <- 3*(n*p)
prob_true <- rep(1/p, p)
a <- rmultinom(n, N, prob_true)
Related
I'm generating small samples (e.g. 24 obs) of normally distributed variable in R. It seems that the resulting variable has a systematically negative autocorrelation.
Code below generates 1000 samples of 24 observations of x and calculates the first three autocorrelations. These are not huge on average (-0.075 to 0.045) but the averages are always negative. Increasing sample size (N) decreases the autocorrelation towards zero. However, my questions is: Why are the random numbers in a small sample negatively autocorrelated?
K <- 1000
N <- 24
ac <- NULL
for (k in 1:K) {
x <- rnorm(n=N)
ac <- rbind(ac, pacf(x, plot=F)$acf[1:3,1,1])
}
apply(ac, 2, mean)
[1] -0.04925651 -0.07523400 -0.04542514
I have generated an observed matrix, here is the code:
obs.matrix <- matrix(c(rep(1,10),rep(2,10)),nrow=10,ncol=2)
and now I want to get 3000 permuted dataset, in each dataset, there should be 10's 1 and 10's 2 but they can be in different columns.
I don't know how to do the rest.
I have tried but failed.
x =obs.matrix
theta = function(resample){sample(c(1,2),replace = T)}
result <- bootstrap::bootstrap(x,3000,theta)
Thanks for help.
Does anyone know how to generate matrix with certain rank in R?
I ultimately want to create data matrix Y = X + E
where rank(X)=k and E~i.i.d.N(0,sigma^2).
The easiest is the identity matrix, which has always full rank. So e.g. use:
k <- 10
mymatrix <- diag(k)
Here, rows and columns are equal to the rank you specify
I suppose you want to mimic a regression model, so you might want to have more rows (meaning 'observations') than columns, (e.g. 'variables'). The following code allows you to specify both:
k <- 5 # rank of your matrix
nobs <- 10 # number of lines within X
X <- rbind(diag(k), matrix(rep(0,k*(nobs-k)), ncol=k))
y <- X + rnorm(nobs)
Note, that X - and therefore also y - now have full column rank. there is no multicollinearity in this 'model'.
I have a question about random sampling.
Are the two following results (A and B) statistically the same?
nobs <- 1000
A <- rt(n=nobs, df=3, ncp=0)
simulations <- 50
B <- unlist(lapply(rep.int(nobs/simulations, times=simulations),function(y) rt(n=y, df=3, ncp=0) ))
I thought it would be but now I've been going back and forth.
Any help would be appreciated.
Thanks
With some small changes, you can even make them numerically equal. You only need to seed the RNG and omit specifying the ncp parameter and use the default value (of 0) instead:
nobs <- 1000
set.seed(42)
A <- rt(n=nobs, df=3)
simulations <- 50
set.seed(42)
B <- unlist(lapply(rep.int(nobs/simulations, times=simulations),function(y) rt(n=y, df=3) ))
all.equal(A, B)
#[1] TRUE
Why don't you get equal results when you specify ncp=0?
Because then rt assumes that you actually want a non-central t-distribution and the values are calculated with rnorm(n, ncp)/sqrt(rchisq(n, df)/df). That means when creating 1000 values at once rnorm is called once and rchisq is called once subsequently. If you create 50 times 20 values, you have alternating calls to these RNGs, which means the RNG states are different for the rnorm and rchisq calls than in the first case.
I have the following four number sets:
A=[1,207];
B=[208,386];
C=[387,486];
D=[487,586].
I need to generate 20000 random numbers between 1 and 586 in which the probability that the generated number belongs to A is 1/2 and to B,C,D is 1/6.
in which way I can do this using R?
You can directly use sample, more specifcally the probs argument. Just divide the probability over all the 586 numbers. Category A get's 0.5/207 weight each, etc.
A <- 1:207
B <- 208:386
C <- 387:486
D <- 487:586
L <- sapply(list(A, B, C, D), length)
x <- sample(c(A, B, C, D),
size = 20000,
prob = rep(c(1/2, 1/6, 1/6, 1/6) / L, L),
replace = TRUE)
I would say use the Roulette selection method. I will try to give a brief explanation here.
Take a line of say length 1 unit. Now break this in proportion of the probability values. So in our case, first piece will be of 1.2 length and next three pieces will be of 1/6 length. Now sample a number between 0,1 from uniform distribution. As all the number have same probability of occurring, a sampled number belonging to a piece will be equal to length of the piece. Hence which ever piece the number belongs too, sample from that vector. (I will give you the R code below you can run it for a huge number to check if what I am saying is true. I might not be doing a good job of explaining it here.)
It is called Roulette selection because another analogy for the same situation can be, take a circle and split it into sectors where the angle of each sector is proportional to the probability values. Now sample a number again from uniform distribution and see which sector it falls in and sample from that vector with the same probability
A <- 1:207
B <- 208:386
C <- 387:486
D <- 487:586
cumList <- list(A,B,C,D)
probVec <- c(1/2,1/6,1/6,1/6)
cumProbVec <- cumsum(probVec)
ret <- NULL
for( i in 1:20000){
rand <- runif(1)
whichVec <- which(rand < cumProbVec)[1]
ret <- c(ret,sample(cumList[[whichVec]],1))
}
#Testing the results
length(which(ret %in% A)) # Almost 1/2*20000 of the values
length(which(ret %in% B)) # Almost 1/6*20000 of the values
length(which(ret %in% C)) # Almost 1/6*20000 of the values
length(which(ret %in% D)) # Almost 1/6*20000 of the values