Finding a minimum of a row in a matrix - r

I've been asked for an assignment to generate 1000 random samples of size 100 from a Uniform[0,1] distribution. For each sample, record the minimum. Then find the mean and variance of these minima.
I used this code so far so that I can have all my samples in a matrix with 1000 rows and 100 columns. (I think that's what it is doing).
x <- matrix(runif(100000,0,1),1000,100)
So now I need to find the minimum of each row, then from there I need the expected and variance and I really don't know where to start. Thanks in advance.

Use this,
apply(x,2,min)
You will get min values for each column

Related

How to get a single number result for a big matrix using var function?

Using the var function,
(a) find the sample variance of your row averages from above;
(b) find the sample variance for your XYZmat as a whole; <-this
(c) Divide the sample variance of the XYZmat by the sample variance of the row averages. The statistical theory says that ratio will on average be close to the row sample size, which is n, here.
(d) Do your results agree with theory? (That is a non-trivial question.) Show your work.
So this is what he asked for in the question, I could not get the single number result, so I just used the sd function and then squared the result. I keep wondering if there is still a way to get a single number result using var function. In my case n is 30, I got it from the previous part of the homework. This is the first R class I am taking and this is the first homework assigned, so the answer should be pretty simple.
I tried as.vector() function and I still got the set of numbers as a result. I played around with var function, no changes.
Unfortunately, I deleted all the code I had since the matrix is so big that my laptop started lagging.
I did not have any error messages, but I kept getting a set of numbers for the answer...
set.seed(123)
XYZmat <- matrix(runif(10000), nrow=100, ncol=100) # make a matrix
varmat <- var(as.vector(XYZmat)) # variance of whole matrix
n <- nrow(XYZmat) # number of rows
n
#> [1] 100
rowmeans <- rowMeans(XYZmat) # row means
varmat/var(rowmeans) # should be near n
#> [1] 100.6907
Created on 2019-07-17 by the reprex package (v0.3.0)

Increasing Bootstrap size in R

I have a time series on returns which is approximately 20 year long. Based on this time series, I want to compute somekind of a moving bootstrap to calculate the mean returns for every observation.
Let me demonstrate this on an example:
Let´s say we have information starting at 01.01.1990 and I want to compute the means with bootstrap starting at 02101.1991.
At 01.01.1991 I want to comupte the mean based on the returns between 01.01.1991-01.01.1990.
Then, on 02.10.1991 I also want to take into account the return of 02.01.1991 and therefore want to calculate the mean with bootstrap based on the returns from 01.01.1990-02.01.1991.
To sum up, the data for my bootstrap should increase by 1 through the time series.
I hope that you can understand what I am trying to say.
I would appreciate any help.
Cheers
Sven
So I managed to answer the question by myself
Let say we want to get the means calculated with bootstrap starting at 01.01.1991 which is the 300th observation in our sample
(Overall we have 1000 observations in our time series)
then the code is the following one:
h <- rep(1, 1000)
for (i in 300:1000) {
h[i] <- mean( sample(rawdata$retoil[1:i] , 5000 , replace=TRUE))
}
the first 300 row of h are 1's and can be deleted in the end
Hope I could help some of you :)

Most efficient way to randomize a matrix in R or in Python

I'm working with a numeric matrix M in R which is quite big (11000 rows per 20 columns). On this matrix, I'm performing a lot of correlation tests
=> the function cor.test(M[i,], M[j,], method='spearman') where i and j are two rows from the matrix (all possible combinations are tested).
The problem as you know is that I'm doing too many tests to get a very reliable p-value returned by this test.
My strategy to overcome this limitation would be to generate a new probability distribution by Bootstrap on my matrix M: I would like to get 100 random matrices generated from M to do the multiple correlations on these matrices and choose the right cut-off for the p-value to get a FDR of 5%.
My question is:
What is the most efficient way to randomize my matrix?
Since it's quite time consumming (I suppose) it could be interresting if the solution could be parallelized.
Thank you in advance for all the usefull answers that you'll provide to me.
In python there is a function random.sample() in module random. If you store M as list of rows, randomly sampling n rows from matrix M without replacement would be like this
M_sample = random.sample(M,n)
However, for bootstrapping, you might want to do random sampling with replacement. To do this, you can use numpy.random.choice():
import numpy
M_sample = numpy.random.choice(M,n,replace=True)
In R, we use sample() to randomly decide the row indices to take, and then use row access to take the rows from the matrices. Randomly sampling n rows from matrix M without replacement is done as follows:
indices = sample(nrow(M), n,replace=FALSE)
M_sample = M[indices, ]
And for randomly sampling with replacement, replace the first line with this:
indices = sample(nrow(M), n,replace=TRUE)

R: how to divide a vector of values into fixed number of groups, based on smallest distance?

I think I have a rather simple problem but I can't figure out the best approach. I have a vector with 30 different values. Now I need to divide the vector into 10 groups in such a way that the mean within group variance is as small as possible. the size of the groups is not important, it can anything between one and 21.
Example. Let's say I have vector of six values, that I have to split into three groups:
Myvector <- c(0.88,0.79,0.78,0.62,0.60,0.58)
Obviously the solution would be:
Group1 <-c(0.88)
Group2 <-c(0.79,0.78)
Group3 <-c(0.62,0.60,0.58)
Is there a function that gives the same outcome as the example and that I can use for my vector withe 30 values?
Many thanks in advance.
It sounds like you want to do k-means clustering. Something like this would work
kmeans(Myvector,3, algo="Lloyd")
Note that I changed the default algorithm to match your desired output. If you read the ?kmeans help page you will see that there are different algorithms to calculate the different clusters because it's not a trivial computational problem. They might necessarily guarantee optimality.

In R: sort the maximum dissimilarity between rows in a matrix

I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.

Resources