R looping over two vectors - r

I have created two vectors in R, using statistical distributions to build the vectors.
The first is a vector of locations on a string of length 1000. That vector has around 10 values and is called mu.
The second vector is a list of numbers, each one representing the number of features at each location mentioned above. This vector is called N.
What I need to do is generate a random distribution for all features (N) at each location (mu)
After some fiddling around, I found that this code works correctly:
for (i in 1:length(mu)){
a <- rnorm(N[i],mu[i],20)
feature.location <- c(feature.location,a)
}
This produces the right output - a list of numbers of length sum(N), and each number is a location figure which correlates with the data in mu.
I found that this only worked when I used concatenate to get the values into a vector.
My question is; why does this code work? How does R know to loop sum(N) times but for each position in mu? What role does concatenate play here?
Thanks in advance.

To try and answer your question directly, c(...) is not "concatenate", it's "combine". That is, it combines it's argument list into a vector. So c(1,2,3) is a vector with 3 elements.
Also, rnorm(n,mu,sigma) is a function that returns a vector of n random numbers sampled from the normal distribution. So at each iteration, i,
a <- rnorm(N[i],mu[i],20)
creates a vector a containing N[i] random numbers sampled from Normal(mu[i],20). Then
feature.location <- c(feature.location,a)
adds the elements of that vector to the vector from the previous iteration. So at the end, you have a vector with sum(N[i]) elements.

I guess you're sampling from a series of locations, each a variable no. of times.
I'm guessing your data looks something like this:
set.seed(1) # make reproducible
N <- ceiling(10*runif(10))
mu <- sample(seq(1000), 10)
> N;mu
[1] 3 4 6 10 3 9 10 7 7 1
[1] 206 177 686 383 767 496 714 985 377 771
Now you want to take a sample from rnorm of length N(i), with mean mu(i) and sd=20 and store all the results in a vector.
The method you're using (growing the vector) is not recommended as it will be re-copied in memory each time an element is added. (See Circle 2, although for small examples like this, it's not so important.)
First, initialize the storage vector:
f.l <- NULL
for (i in 1:length(mu)){
a <- rnorm(n=N[i], mean=mu[i], sd=20)
f.l <- c(f.l, a)
}
Then, each time, a stores your sample of length N[i] and c() combines it with the existing f.l by adding it to the end.
A more efficient approach is
unlist(mapply(rnorm, N, mu, MoreArgs=list(sd=20)))
Which vectorizes the loop. Unlist is used as mapply returns a list of vectors of varying lengths.

Related

How to save values in Vector using R

I am supposed to find the mean and standard deviation at each given sample size (N), using the "FOR LOOP". I started writing the code as below, I am required to save all the means into vector "p". How do I save all the means into one vector?
sample.sizes =c(3,10,50,100,500,1000)
mean.sds = numeric(0)
for ( N in sample.sizes ){
x <- rnorm(3,mean=0,sd=1)
mean.sds[i]
}
mean(x)
Actually you are doing many thing wrong?
If you are using variable N in for loop, you are not using it anywhere
for (N in 'some_vector') actually means N will take that value one by one. So N in sample sizes will first take, 3 then 10 then 50 and so on.
Now where does i come into picture?
You are calculating x for each iteration of N. In fact you are not using N anywhere in the loop?
first x will return 3 values. In the next line you intend to store these three values in just ith value of mean.sds where i is unknown and storing three values into one value, as it is, is not logically possible.
Do you want this?
sample.sizes =c(3,10,50,100,500,1000)
mean.sds = numeric(0)
for ( i in seq_along(sample.sizes )){
x <- rnorm(sample.sizes[i], mean=0, sd=1)
mean.sds[i] <- mean(x)
}
mean.sds
[1] 0.6085489531 -0.1547286299 0.0052106559 -0.0452804986 -0.0374094936 0.0005667246
I replaced N with seq_along(sample.sizes) which will give iterations equal to the number of that vector. Six in this example.
I passed each ith element to first argument of rnorm to generate these many random values.
Stored each random value into single vector. calculated its mean (one value only) and stored in ith value of your empty vector.

Iterating a vector over a list in R

I am dealing with some computational feature extracting problem from RNA data, and I found myself unable to deal with this question:
I have n sequences (say two for example) from which I obtained an iterated statistic i times (kind of doing a Monte Carlo iteration for analizing distribution of obtained statistics compared with original).
Example:
Say we iterate 10 times
n <- 10
I got a vector of 20 values with all the iterations, but this vector corresponds to two different sequences, so I must divide this vector in two equal parts (the iterations are ordered 1:10 - 1:10 for each sequence).
MFEit <- c(10, 12, 34, 32, 12 .....) ## vector of length 20
MFEit.split <- split(MFEit, ceiling(MFEit.along/n5))
This generates a list of two items each with 10 values, named $1 and $2
On the other hand I have a vector of two values which are the original statistics, each corresponding to each original sequence
MFE <- c(25, 15)
What I want to do is to know how many values of first item in the list MFEit.split, are equal or less than the first value of MFE, and, iteratively, how many values of second item in the list MFEit.split, are equal or less than the second value of MFE, and so on, provided that I would have more than two values or items.
I know how to do it one by one, say:
R <- length(subset(MFEit.split$`1`, MFEit.split$`1`<=MFE[1]))
R <- length(subset(MFEit.split$`2`, MFEit.split$`1`<=MFE[2]))
But... how to include this into a loop so that I can get iteratively each comparison, no matter how many MFE values or items in the list I have?
The desired output would be a vector called R, with n values corresponding to each comparison.
Any help?...

Generate many sample pairs from normal distribution

I trying to learn how to use R for statistics and I would like to how can I can I generate 20 000 (K number of pairs) times a set of two samples each with 50 points from the same normal distribution(mean 2.5 and variance 9)?
So far I know that this is how I make 50 points from a normal distribution:
rnorm(50,2.5,3)
But how do I generate 20 000 times a set of two samples so I can perform tests on the K pairs later?
x <- lapply(c(1:20000),
function(x){
lapply(c(1:2), function(y) rnorm(50,2.5,3))
})
This produces 20000 paired samples, where each sample is composed of 50 observations from a N(2.5,3^2) distribution. Note that x is a list where each slot is a list of two vector of length 50.
To t-test the samples, you'll need to extract the vectors and give them to function t-test.
t.tests <- lapply(x, function(y) t.test(x=y[[2]], y=y[[1]]))
Something along the lines of
yourresults <- replicate(20000,{yourtest(matrix(rnorm(100,2.5,3),nc=2),<...>)})
or
yourresults <- replicate(20000,{yourtest(rnorm(50,2.5,3),rnorm(50,2.5,3),<...>)})
where yourtest is whatever your function is that's carrying out some test, and <...> is whatever other arguments you pass to yourtest. The first one is suitable if it expects a matrix with two columns, the second is suitable if it expects two vectors. You can adapt this approach to other forms of input - such as a formula interface - in the obvious way.

how to select a matrix column based on column name

I have a table with shortest paths obtained with:
g<-barabasi.game(200)
geodesic.distr <- table(shortest.paths(g))
geodesic.distr
# 0 1 2 3 4 5 6 7
# 117 298 3002 2478 3342 3624 800 28
I then build a matrix with 100 rows and same number of columns as length(geodesic.distr):
geo<-matrix(0, nrow=100, ncol=length(unlist(labels(geodesic.distr))))
colnames(geo) <- unlist(labels(geodesic.distr))
Now I run 100 experiments where I create preferential attachment-based networks with
for(i in seq(1:100)){
bar <- barabasi.game(vcount(g))
geodesic.distr <- table(shortest.paths(bar))
distance <- unlist(labels(geodesic.distr))
for(ii in distance){
geo[i,ii]<-WHAT HERE?
}
}
and for each experiment, I'd like to store in the matrix how many paths I have found.
My question is: how to select the right column based on the column name? In my case, some names produced by the simulated network may not be present in the original one, so I need not only to find the right column by its name, but also the closest one (suppose my max value is 7, I may end up with a path of length 9 which is not present in the geo matrix, so I want to add it to the column named 7)
There is actually a problem with your approach. The length of the geodesic.distr table is stochastic, and you are allocating a matrix to store 100 realizations based on a single run. What if one of the 100 runs will give you a longer geodesic.distr vector? I assume you want to make the allocated matrix bigger in this case. Or, even better, you want run the 100 realizations first, and allocate the matrix after you know its size.
Another potential problem is that if you do table(shortest.paths(bar)), then you are (by default) considering undirected distances, will end up with a symmetric matrix and count all distances (expect for self-distances) twice. This may or may not be what you want.
Anyway, here is a simple way, with the matrix allocated after the 100 runs:
dists <- lapply(1:100, function(x) {
bar <- barabasi.game(vcount(g))
table(shortest.paths(bar))
})
maxlen <- max(sapply(dists, length))
geo <- t(sapply(dists, function(d) c(d, rep(0, maxlen-length(d)))))

How to get a value of a multi-dimensional array by an INCOMPLETE vector of indices

This question is very similar to
R - how to get a value of a multi-dimensional array by a vector of indices
I have:
dim_count <- 5
dims <- rep(3, dim_count)
pi <- array(1:3^5, dims)
I want to get an entire line, but with an automatic building of the address of this line.
For example, I would like to get:
pi[1,,2,2,3]
## [1] 199 202 205
You could insert a sequence covering the whole dimension in the appropriate slot:
do.call("[",list(pi,1,1:dim(pi)[2],2,2,3))
By the way, defining a variable called pi is a little dangerous (I know this was inherited from the previous question) -- suppose you tried a few lines later to compute the circumference of a circle as pi*diameter ...

Resources