simulating the t -distributions -- random samples - r

I am new to simulation exercises in R. I want to create 1000 samples of size 25 from a t distribution with degrees of freedom 10.
Do I need to create a single vector of data from the rt generator, and then sample repeatedly from that? So, for example, I could create the vector:
singlevector <- rt(5000, 10) , which generates data from a t-distribution of size 5000 and df = 10. So, I would treat this as my population and then sample from it. I chose the population size of 5000 arbitrarily here.
OR, should I create my 1000 samples calling on this random t generator every time?
In other words, create a matrix with 25 rows and 1000 columns, each column containing vector corresponding to a new call of rt(25, 10).

Since you are sampling independent, identically distributed values, all three of these approaches are statistically equivalent.
call the random number generator once to get as many (or more) values than you need, then sample that vector without replacement
call the random number generator 1000 times, picking 25 values each time
call the random number generator once, picking 25000 values, then subdivide the vector into individual samples in order (rather than randomly)
The latter two are not just statistically but computationally equivalent. In the first approach, the order of samples gets scrambled, but that makes no difference to the statistical properties.
Approach #1:
set.seed(101)
x1 <- rt(25000,10)
r1 <- do.call(cbind,split(x1,sample(0:24999) %/% 25))
Illustrating the equivalence of #2 and #3:
set.seed(101)
r2 <- replicate(1000, rt(25, 10))
set.seed(101)
r3 <- matrix(rt(25000,10),nrow=25)
identical(r2,r3) ## TRUE
In general solution #3 is fastest (but all of these approaches are very fast for problems of this order of magnitude, i.e. approx 5 milliseconds (#3) vs 10 milliseconds (#2) for 25 x 1000 samples on my laptop); I would pick whichever approach is easiest for you to understand when you read the code.

Related

Simulation to find random sequences

With R I can try to find the probability that the Age vector below resulted from random sampling. I used the runs test (from randtests package) with resulted in p-value = 0.2892. Other colleagues used the rle functune (run length encoding in R) or others to simulate whether the probabilities of random allocation generating the observed sequences. Their result shows p < 0.00000001 that this sequence is the result of random sampling. I am trying to find the R code to replicate their findings. any help is highly appreciated on how to simulate to replicate their findings.
Update: I received advice from statistician that I can do this using non-parametric bootstrap. However, I still do not know how this can be done. I appreciate your help.
example:
Age <-c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73,69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73) ;
randtests::runs.test(Age);
X <- rle(Age);X$lengths
What was initially presented isn't the whole story. If one looks at the supplement where these numbers are from, the reported p-value is for comparing two vectors. OP only provides one, and hence the task is not well-defined.
The full assertion of the research article is that
group1 <- c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73)
group2 <- c(69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73)
being two independent random samples has a p-value < 0.00000001.
Even checking identity along position (10 entries in original) with permutations within a group, I'm seeing only 2 or 3 draws per million that have a similar number of identical values. I.e., something like:
set.seed(123)
mean(replicate(1e6, sum(sample(group1, length(group1)) == group2)) >= 10)
# 2e-06
Testing correlations and/or bootstrapping could easily be in the p-value range that is reported (nothing as extreme in 100 million simulations).

Random generated number are linear combination among them even if not specified

I am simulating some draws using random numbers. Unlikely, the generated numbers are not random as I would like. In fact, I obtain that there are some linear combinations.
In details, I have the following starting data:
start_vector = c(1,10,30,40,50,100) # length equal to 6
residual_of_model = 5
n = 1000 # Number of simulations
I try to simulate n observations from a random normal distribution for each of the start_vector elements, assuming it as a "random noise" to add to the original value (that is the one into start_vector):
out_vec <- matrix(NA, nrow = n, ncol = length(start_vector))
for (h_aux in 1:length(start_vector))
{
random_noise <- rnorm(n, 0, residual_of_model)
out_vec[,h_aux] <- as.numeric(start_vector[h_aux]) + random_noise
}
At this point, I obtain a matrix of size 6x1000. In theory, I assume all the columns and the rows in the matrix are linearly independent among them.
If I try to check it, using the findLinearCombos() function from the caret package I obtain that all the columns are indepent:
caret::findLinearCombos(out_vec)
If I try to evaluate the independence among the rows, using the following code:
caret::findLinearCombos(t(out_vec))
I obtain that all the rows from 7 to 1000 are a linear combination of the first 6 (the length of start_vector).
It is really strange in my opinion, I would like to not observe no dependencies at all since the rows are generated adding a random number using rnorm.
What am I missing? Is there some bug? Thanks in advance!

Simulating averages of normally-distributed random variables in R

I'm trying to simulate some data in R to check my manual calculations of how variance changes in a simple model that involves a sequence of normally-distributed random variables being averaged. However, I find I'm getting results that are not only inconsistent with my manual calculations, but inconsistent with each other. Clearly I'm doing something wrong, but I'm having trouble isolating the problem(s).
Conceptually, the model involves two steps: First, storing a variable, and second, using the stored variable(s) to produce an output. The output is then stored as a new variable, contributing to future outputs, and so on. I assume storage is noisy (i.e., what's stored is a random variable rather than a constant), but that no further noise is added in output production, which simply involves averaging the existing stored variables. Thus, my model involves the following steps, where V_i is the variable stored at step i, and O_i is the ith output:
and so on.
I've tried simulating this in R in two ways: First,
nSamples <- 100000
o1 <- rnorm(nSamples) # First output
o2 <- rowMeans(cbind(rnorm(nSamples, mean=o1),rnorm(nSamples))) # Second output, averaged from first two stored variables.
o3 <- rowMeans(cbind(rnorm(nSamples, mean=o2),rnorm(nSamples, mean=o1),rnorm(nSamples))) # Third output, averaged from first three stored variables.
This gives me
var(o1) # Approximately 1, as per my manual calculations.
var(o2) # Approximately .75, as per my manual calculations.
var(o3) # Approximately .64, where my manual calculations give 19/36 or approximately .528.
Initially, I trusted the code and assumed my calculations were wrong. Then, I tried the following alternative code, which more explicitly follows the steps I used manually:
nSamples <- 100000
initialValue <- 0
v1 <- rnorm(nSamples, initialValue)
o1 <- v1
v2 <- rnorm(nSamples, o1)
o2 <- rowMeans(cbind(v1, v2))
v3 <- rnorm(nSamples, o2)
o3 <- rowMeans(cbind(v1, v2, v3))
This gives me
var(o1) # Approximately 1, as per my calculations.
var(o2) # Approximately 1.25, where my calculations give .75.
var(o3) # Approximately 1.36, where my calculations give approximately .528.
Thus, clearly I have done something wrong in using at least two of these three methods, but I'm having trouble isolating the source of the problems. What is it I'm missing that is leading my code to behave differently than what I'm expecting? And what is the difference between the two code examples that leads the variance to decrease in one and increase in the other?
Your correct calculation is the first one, where you are generating new realizations of the normal random variable when averaging, as opposed to using the realizations generated in the previous step.
In fact, the distribution of O2 assumes that the two normal random variables being averaged are mutually independent.
In your second calculation, this is not true, as you are averaging v1 and v2, which are not independent since both depend on o1. This is why you get larger variances in the second case.

sample integer values with specified mean

I want to generate a sample of integer numbers in R with a specified mean.
I used mu+sd*scale(rnorm(n)) to generate a sample of n values that has exactly the mean=mu
but this generates floating-point values; I would like to generate integer values instead. For example, I would like to generate a sample of mean=4. My sample size n=5, an example of generated values would be {2,6,4,3,5}.
Any ideas on how to do this in R while satisfying the constraint of a specific value of the mean?
Picking n values with a mean of m is equivalent to picking n values that sum to m*n. (I'm assuming you're going to stick to positive integers -- otherwise things get much harder!) Here is a solution based on sampling partitions (sets of values that add up to the desired total) uniformly, but I'm not sure it's what you want, since it doesn't sample uniformly over values, but over partitions ... perhaps someone else can do better, or figure out how to reweight the samples.
This brute-force solution will also probably fail for cases much larger than your example (there are 627 partitions for a total of 20, 5604 for a total of 30, 37338 for a total of 40 ...)
m <- 4
n <- 5
library("partitions")
pp <- parts(m*n) ## all sets of integers that sum to m*n (=20 here)
## restrict to partitions with exactly n (=5) non-zero values.
pp5 <- pp[1:5,colSums(pp>0)==n]
set.seed(101) ## for reproducibility
## sample uniformly from this set
pp5[,sample(ncol(pp5),size=1)] ## 9, 5, 4, 1, 1

fisher's exact test (R) - simulated p-value does not vary

I have a problem using fisher’s exact test in R with a simulated p-value, but I don’t know if it’s a caused by “the technique” ( R ) or if it is (statistically) intended to work that way.
One of the datasets I want to work with:
matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,3,57,11,2,87,1,2,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704,40,759,404,151,1491,9,40,144),ncol=2,nrow=27)
The resulting p-value is always the same, no matter how often I repeat the test:
p = 1 / (B+1)
(B = number of replicates used in the Monte Carlo test)
When I shorten the matrix it works if the number of rows is lower than 19. Nevertheless it is not a matter of number of cells in the matrix. After transforming it into a matrix with 3 columns it still does not work, although it does when using the same numbers in just two columns.
Varying simulated p-values:
>a <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=2,nrow=18)
>b <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704),ncol=2,nrow=19)
>c <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=3,nrow=12)
>fisher.test(a,simulate.p.value=TRUE)$p.value
Number of cells in a and b are the same, but the simulation only works with matrix a.
Does anyone know if it is a statistical issue or a R issue and, if so, how it could be solved?
Thanks for your suggestions
I think that you are just seeing a very significant result. The p-value is being computed as the number of simulated (and the original) matrices that are as extreme or more extreme than the original. If none of the randomly generated matrices are as or more extreme then the p-value will just be 1 (the original matrix is as extreme as itself) divided by the total number of matrices which is $B+1$ (the B simulated and the 1 original matrix). If you run the function with enough samples (high enough B) then you will start to see some of the random matrices as or more extreme and therefor varying p-values, but the time to do so is probably not reasonable.

Resources