dynamic correlation between vectors (with sampling) - r

Say, I have two vectors of the same length
A = mtcars$mpg
B = mtcars$cyl
I can calculate correlation between whole vectors
cor (A, B)
and get one single value (-0.852162).
What I need is to calculate correlation between the two vectors with a sampling rate of 10, which means I start at the first datapoint in A and B, take 5 values on the right from it (there are no values on the left), calculate a correlation coefficient ad write it in a vector C. Then I take the next value in A & B, take 5 values on the right and 1 on the left, write it into a vector; then shift again to the next value and so forth. The resulting vector C must contain the same number of values as A or B (N=32), and each value in C represents a correlation b/w A and B with a sampling rate 10 (5 values on the left and 5 on the right from that datapoint, if availible).
Is there any elegant and simple way to do it in R?
P.S.: The ease of coding is more important, than the time needed for calculations.

The TTR package may provide what you are looking for.
It should be as simple as:
TTR::runCor(A, B)
There is a whole blog post about rolling correlation here.

Related

Why does row echelon form work for finding the Rank of a matrix?

I have been studying Linear Algebra and came across gauss-elimination method. While it does give the correct answers , I am not able to understand how it works.
So, we compare each row with rest of the rows to see if any is dependent on the current row. But how does it identify, if One of the rows in the matrix, is a linear combination of 2 or more other rows??
This builds upon the result that elementary row operations doesn't change the rank of the matrix. Hence if we reduce the original matrix to its row echelon form, the two matrices share the same rank.
To see that the non-zero rows are linear independent, suppose r1, r2,...,rk are the non-zero rows (sorted in that order), consider c1r1+...+ckrk = 0. Look at the the position of the leading non-zero entry of r1, for that position, the other entries have value 0 there, hence we conclude that c1 = 0. After that, we can conclude the same argument for c2 and so on and conclude that all f them must be zero and hence the rank is equal to the number of non-zero rows.

Generate sets from given overlap matrix

Note: I edited the original question to explain more precisely.
While I was doing a simulation for my new method, I needed to generate a special type of dataset consists of multiple subset. The problem is that there is some "shared" variables across the subsets, and the number of shared variable is called "overlap" here. Since the distribution of overlap proportion is given, I need to generate an appropriate list of variables and their overlap follows the given distribution. But I have failed to implement such algorithm...
I am not sure whether there is a specific algorithm for this kind of question,
but I have failed to find such thing after a long search.
I prefer R solution, but anything others also will be very appreciated. Please help me to solve this problem! Thank you so much in advance!
The below is a standardized explanation for my problem. I tried to explain as general as possible I can, but please give me any suggestion if it is not sufficient.
Purpose: Generate n sets from given overlap matrix of elements. Each set contains k elements.
Input: There is a n*n matrix whose (i,j)th cell value represents a number of overlapped elements from (i)th set to (j)th set.
Output: A list of k element identifiers (whatever can be used such as number) for n sets.
Assumption: The number of elements for each set is k, and it is same across all n sets. Hence, the input matrix is symmetric.
Example (assumes k=3 and n=3)
Input
3 1 0
1 3 1
0 1 3
Output
Set 1: A B C
Set 2: A D E
Set 3: D F G
In the above example input, (1,2)th and (2,1)th cells are 1 because set 1 and 2 share "A" element and vice versa, and diagonal cells are 3(=k) because each set shares all elements with itself.
I would repeat the following process until I had accounted for all the matrix entries:
1) Treat the matrix as the adjacency matrix of a graph, and find the largest clique in it. That is, find the largest possible set S of indexes such that for all i, j in set S M(i,j) > 0
2) Create an item that is in all of the sets which correspond to the indexes in S - in fact, if the minimum value of M(i,j) = v, create v such items.
3) subtract v from M(i,j) for all i, j in set S, accounting for the counts generated by the items you have just created.

Generating two sets of numbers where the sum of each set and the sum of their dot product is N

In this question Getting N random numbers that the sum is M, the object was to generate a set of random numbers that sums to a specific number N. After reading this question, I started playing around with the idea of generating sets of numbers that satisfy this condition
sum(A) == sum(B) && sum(B) == sum(A * B)
An example of this would be
A <- c(5, 5, -10, 6, 6, -12)
B <- c(5, -5, 0, 6, -6, 0)
In this case, the three sums equal zero. Obviously, those sets aren't random, but they satisfy the condition. Is there a way to generate 'random' sets of data that satisfy the above condition? (As opposed to using a little algorithm as in the above example.)
(Note: I tagged this as an R question, but the language really doesn't matter to me.)
You'd need to define the first vector in n-dimensional space, and the 2nd one will have N-2 degrees of freedom (i.e. random numbers) since the sum and one angle are already determined.
The 2nd vector would need to be transformed into N-dimensional space; There are infinitely many transforms that could work, so if you don't care about the probability distribution of the resulting vectors, just choose the one that's most intuitive to you.
There's a nice geometrical interpretation to the first constraint: it constrains the 2nd vector to a (hyper-)plane in N-dimensional space; the 2nd constraint doesn't have a simple geometric interpretation.
check out hyperspherical cooridnates.
You can generate one set completely randomly. And generate randomly all numbers in set B except for two numbers. Since you have two equations you should be able to solve for those two numbers.

In R: sort the maximum dissimilarity between rows in a matrix

I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.

Make a matrix full-ranked?

How can I turn a regular matrix into a matrix full-ranked in R? Is there an available method for that?
I have a matrix that may have linearly dependent columns and I need to
pass it to a function that requires its argument to be a matrix with
full rank. Since linearly dependent columns are not of interest
anyway, I am looking for a function that removes such columns until
the matrix is full rank. There may be several solutions of course, but
any one of them should be fine.
Right now I am just constructing the matrix column by column and only
add a column if its the resulting matrix is still fullrank, but it
feels like there should be a better way to do this.
Another approach is to minimize |y - Ax|2 + c |x|2,
by tacking an identity matrix on to A and zeros to y.
The parameter c (a.k.a. λ)
trades off fitting y - Ax, and keeping |x| small.
Then run a second fit with the r largest components of x,
r = rank(A) (or any number you please).

Resources