Generate many sample pairs from normal distribution - r

I trying to learn how to use R for statistics and I would like to how can I can I generate 20 000 (K number of pairs) times a set of two samples each with 50 points from the same normal distribution(mean 2.5 and variance 9)?
So far I know that this is how I make 50 points from a normal distribution:
rnorm(50,2.5,3)
But how do I generate 20 000 times a set of two samples so I can perform tests on the K pairs later?

x <- lapply(c(1:20000),
function(x){
lapply(c(1:2), function(y) rnorm(50,2.5,3))
})
This produces 20000 paired samples, where each sample is composed of 50 observations from a N(2.5,3^2) distribution. Note that x is a list where each slot is a list of two vector of length 50.
To t-test the samples, you'll need to extract the vectors and give them to function t-test.
t.tests <- lapply(x, function(y) t.test(x=y[[2]], y=y[[1]]))

Something along the lines of
yourresults <- replicate(20000,{yourtest(matrix(rnorm(100,2.5,3),nc=2),<...>)})
or
yourresults <- replicate(20000,{yourtest(rnorm(50,2.5,3),rnorm(50,2.5,3),<...>)})
where yourtest is whatever your function is that's carrying out some test, and <...> is whatever other arguments you pass to yourtest. The first one is suitable if it expects a matrix with two columns, the second is suitable if it expects two vectors. You can adapt this approach to other forms of input - such as a formula interface - in the obvious way.

Related

functions for matrices / cluster analysis

I have the following problem. Maybe you can help me!
I have 60 matrices (60 trials). Each of those matrices is 16*1000 fileds big (16 angles and 1000 timestamps). The 16 angels are bodyangels.
Now I want to calculate the euclicid distance for each of the combinations (1770). So I would get 1770 matrices which are are 16*1000 fileds big.
The list for every combinations I would get through this formula:
>comb <- combinations(n=60, r=2, v=n, set=TRUE, repeats.allowed=FALSE)
The formula which I want to apply to each of these combinations is:
> dab <- sqrt(sum((a-b)^2)) # a and b are two matrices
I tried to create a function, which is fortunately only for single values, but not for whole matrices.:
>dist.fun <- function(x,y)
>{
>z <- sqrt(sum((x)-(y))^2)
> return(z)
>}
Out of those distances I want to create an euclidic distance matrix to do a cluster analysis.
>plot(hclust((as.dist(m)),method="ward.D2")) # m is the euclidic disdance matrix
I hope, anyone can help me with this problem. The data is biomechanical data from gymnasts, which I want to investigate in terms of variant and invariant components and prototypes.

Most efficient way to randomize a matrix in R or in Python

I'm working with a numeric matrix M in R which is quite big (11000 rows per 20 columns). On this matrix, I'm performing a lot of correlation tests
=> the function cor.test(M[i,], M[j,], method='spearman') where i and j are two rows from the matrix (all possible combinations are tested).
The problem as you know is that I'm doing too many tests to get a very reliable p-value returned by this test.
My strategy to overcome this limitation would be to generate a new probability distribution by Bootstrap on my matrix M: I would like to get 100 random matrices generated from M to do the multiple correlations on these matrices and choose the right cut-off for the p-value to get a FDR of 5%.
My question is:
What is the most efficient way to randomize my matrix?
Since it's quite time consumming (I suppose) it could be interresting if the solution could be parallelized.
Thank you in advance for all the usefull answers that you'll provide to me.
In python there is a function random.sample() in module random. If you store M as list of rows, randomly sampling n rows from matrix M without replacement would be like this
M_sample = random.sample(M,n)
However, for bootstrapping, you might want to do random sampling with replacement. To do this, you can use numpy.random.choice():
import numpy
M_sample = numpy.random.choice(M,n,replace=True)
In R, we use sample() to randomly decide the row indices to take, and then use row access to take the rows from the matrices. Randomly sampling n rows from matrix M without replacement is done as follows:
indices = sample(nrow(M), n,replace=FALSE)
M_sample = M[indices, ]
And for randomly sampling with replacement, replace the first line with this:
indices = sample(nrow(M), n,replace=TRUE)

Multiple Matrix Operations in R with loop based on matrix name

I'm a novice R user, who's learning to use this coding language to deal with data problems in research. I am trying to understand how knowledge evolves within an industry by looking at patenting in subclasses. So far I managed to get the following:
# kn.matrices<-with(patents, table(Class,year,firm))
# kn.ind <- with(patents, table(Class, year))
patents is my datafile, with Subclass, app.yr, and short.name as three of the 14 columns
# for (k in 1:37)
# kn.firms = assign(paste("firm", k ,sep=''),kn.matrices[,,k])
There are 37 different firms (in the real dataset, here only 5)
This has given 37 firm-specific and 1 industry-specific 2635 by 29 matrices (in the real dataset). All firm-specific matrices are called firmk with k going from 1 until 37.
I would like to perform many operations in each of the firm-specific matrices (e.g. compare the numbers in app.yr 't' with the average of the 3 previous years across all rows) so I am looking for a way that allows me to loop the operations for every matrix named firm1,firm2,firm3...,firm37 and that generates new matrices with consistent naming, e.g. firm1.3yearcomparison
Hopefully I framed this question in an appropriate way. Any help would be greatly appreciated.
Following comments I'm trying to add a minimal reproducible example
year<-c(1990,1991,1989,1992,1993,1991,1990,1990,1989,1993,1991,1992,1991,1991,1991,1990,1989,1991,1992,1992,1991,1993)
firm<-(c("a","a","a","b","b","c","d","d","e","a","b","c","c","e","a","b","b","e","e","e","d","e"))
class<-c(1900,2000,3000,7710,18000,19000,36000,115000,212000,215000,253600,383000,471000,594000)
These three vectors thus represent columns in a spreadsheet that forms the "patents" matrix mentioned before.
it looks like you already have a 3 dimensional array with all your data. You can basically view this as your 38 matrices all piled one on top of the other. You don't want to split this into 38 matrices and use loops. Instead, you can use R's apply function and extraction functions. Just view the help topic on the apply() family and it should show you how to do what you want. Here are a few basic examples
examples:
# returns the sums of all columns for all matrices
apply(kn.matrices, 3, colSums)
# extract the 5th row of all matrices
kn.matrices[5, , ]
# extract the 5th column of all matrices
kn.matrices[, 5, ]
# extract the 5th matrix
kn.matrices[, , 5]
# mean of 5th column for all matrices
colMeans(kn.matrices[, 5, ])

R looping over two vectors

I have created two vectors in R, using statistical distributions to build the vectors.
The first is a vector of locations on a string of length 1000. That vector has around 10 values and is called mu.
The second vector is a list of numbers, each one representing the number of features at each location mentioned above. This vector is called N.
What I need to do is generate a random distribution for all features (N) at each location (mu)
After some fiddling around, I found that this code works correctly:
for (i in 1:length(mu)){
a <- rnorm(N[i],mu[i],20)
feature.location <- c(feature.location,a)
}
This produces the right output - a list of numbers of length sum(N), and each number is a location figure which correlates with the data in mu.
I found that this only worked when I used concatenate to get the values into a vector.
My question is; why does this code work? How does R know to loop sum(N) times but for each position in mu? What role does concatenate play here?
Thanks in advance.
To try and answer your question directly, c(...) is not "concatenate", it's "combine". That is, it combines it's argument list into a vector. So c(1,2,3) is a vector with 3 elements.
Also, rnorm(n,mu,sigma) is a function that returns a vector of n random numbers sampled from the normal distribution. So at each iteration, i,
a <- rnorm(N[i],mu[i],20)
creates a vector a containing N[i] random numbers sampled from Normal(mu[i],20). Then
feature.location <- c(feature.location,a)
adds the elements of that vector to the vector from the previous iteration. So at the end, you have a vector with sum(N[i]) elements.
I guess you're sampling from a series of locations, each a variable no. of times.
I'm guessing your data looks something like this:
set.seed(1) # make reproducible
N <- ceiling(10*runif(10))
mu <- sample(seq(1000), 10)
> N;mu
[1] 3 4 6 10 3 9 10 7 7 1
[1] 206 177 686 383 767 496 714 985 377 771
Now you want to take a sample from rnorm of length N(i), with mean mu(i) and sd=20 and store all the results in a vector.
The method you're using (growing the vector) is not recommended as it will be re-copied in memory each time an element is added. (See Circle 2, although for small examples like this, it's not so important.)
First, initialize the storage vector:
f.l <- NULL
for (i in 1:length(mu)){
a <- rnorm(n=N[i], mean=mu[i], sd=20)
f.l <- c(f.l, a)
}
Then, each time, a stores your sample of length N[i] and c() combines it with the existing f.l by adding it to the end.
A more efficient approach is
unlist(mapply(rnorm, N, mu, MoreArgs=list(sd=20)))
Which vectorizes the loop. Unlist is used as mapply returns a list of vectors of varying lengths.

How should I combine two loops in r?

I want to ask your opinion since I am not so sure how to do it. This is regarding one part of my paper project and my situation is:
Stage I
I have 2 groups and for each group I need to compute the following steps:
Generate 3 random numbers from normal distribution and square them.
Repeat step 1 for 15 times and at the end I will get 15 random numbers.
I already done stage I using for loop.
n1<-3
n2<-3
miu<-0
sd1<-1
sd2<-1
asim<-15
w<-rep(NA,asim)
x<-rep(NA,asim)
for (i in 1:asim) {
print(i)
set.seed(i)
data1<-rnorm(n1,miu,sd1)
data2<-rnorm(n2,miu,sd2)
w[i]<-sum(data1^2)
x[i]<-sum(data2^2)
}
w
x
Second stage is;
Stage II
For each group, I need to:
Sort the group;
Find trimmed mean for each group.
For the whole process (stage I and stage II) I need to simulate them for 5000 times. How am I going to proceed with step 2? Do you think I need to put another loop to proceed with stage II?
Those are tasks you can do without explicit loops. Therefore, note a few things: It is the same if you generate 3 times 15 times 2000 random numbers or if you generate them all at once. They still share the same distribution.
Next: Setting the seed within each loop makes your simulation deterministic. Call set.seed once at the start of your script.
So, what we will do is to generate all random numbers at once, then compute their squared norms for groups of three, then build groups of 15.
First some variable definitions:
set.seed(20131301)
repetitions <- 2000
numperval <- 3
numpergroup <- 15
miu <- 0
sd1 <- 1
sd2 <- 1
As we need two groups, we wrap the group generation stuff into a custom function. This is not necessary, but does help a bit in keeping the code clean an readable.
generateGroup <- function(repetitions, numperval, numpergroup, m, s) {
# Generate all data
data <- rnorm(repetitions*numperval*numpergroup, m, s)
# Build groups of 3:
data <- matrix(data, ncol=numperval)
# And generate the squared norm of those
data <- rowSums(data*data)
# Finally build a matrix with 15 columns, each column one dataset of numbers, each row one repetition
matrix(data, ncol=numpergroup)
}
Great, now we can generate random numbers for our group:
group1 <- generateGroup(repetitions, numperval, numpergroup, miu, sd1)
group2 <- generateGroup(repetitions, numperval, numpergroup, miu, sd2)
To compute the trimmed mean, we again utilize apply:
trimmedmeans_group1 <- apply(group1, 1, mean, trim=0.25)
trimmedmeans_group2 <- apply(group2, 1, mean, trim=0.25)
I used mean with the trim argument instead of sorting, throwing away and computing the mean. If you need the sorted numbers explicitly, you could do it by hand (just for one group, this time):
sorted <- t(apply(group1, 1, sort))
# We have to transpose as apply by default returns a matrix with each observation in one column. I chose the other way around above, so we stick with this convention and transpose.
Now, it would be easy to throw away the first and last two columns and generate the mean, if you want to do it manually.

Resources