Forming a query vector in LSA - information-retrieval

After performing the SVD of a term-document matrix, and getting a reduced rank matrix, various sources have stated the following reduced query vector formula. It seems easy to see how its derived.
However, in this link, the query vector is calculated as centroid of the corresponding reduced term vectors. I tried to see if the two were the same, but the results were different.
What is the difference between the two and what are the pros/cons of using either?

Related

Which methods can I use to calculate correlation among words in quanteda?

My question is a continuation of this.
After cleaning my text data and visualizing it using a wordcloud, I want to see which words are correlated to each other. Here comes the problem:
quantedahas the function textstat_simil, but it says
similarity. So, are "similarity" and "correlation" in this case the same thing? (Is distance also related?).
Moreover, my dfm looks like a binary matrix. Is in this case phi
correlation (from chi'squared statistics) more indicated? Can I
calculate this via quanteda?
Do you guys have any other content rather than the source code of
github that explain in more detail the methods to calculate
similarity or distance measures? (I couldn't understand from
this
code, sorry).
Thanks for you patient!
To compute Pearson’s product-moment correlations among features, you would use:
textstat_simil(x, method = “correlation”, margin = “features”)
The documentation makes this pretty clear, and the correlation method is the default.
Pearson’s correlation would not be the most appropriate for binary data, and we currently do not implement Spearman’s or other correlation methods more appropriate for categorical or ordinal data. However you can always coerce the dfm to an ordinary matrix (use as.matrix()) and then use the stats::cor() methods, which include Spearman’s.
As for the last question, we use the standard implementation of these measures. If you want more clarity on what they mean, I suggest asking on Cross-Validated.

Different bandwidth specification in mean-shift clustering with different packages in R

I want to perform mean-shift clustering in R and found out that there are at least two packages that have this functionality: MeanShift and meanShiftR. As showed here the latter is much faster and as I tried out the first one and it took a long time to perform a clustering, I'm keen on choosing meanShiftR. However meanShiftR::meanShift function has rather uncommon way of bandwidth specification, see part of documentation:
queryData A matrix or vector of points to be classified by the mean
shift algorithm. Values must be finite and non-missing.
bandwidth A vector of length equal to the number of columns in the queryData matrix, or length one when queryData is a vector. This
value will be used in the kernel density estimate for steepest ascent
classification. The default is one for each dimension.
I'm not an expert in mean-shift clustering, but the only banwidth specifications I have found in the literature is that bandwidth is scalar or positive definite, symmetric matrix, not a vector. So is this the technical trick to represent the bandwidth and the value of bandwidth have to be the same for each dimension? Or maybe it can vary?
The other issue is that even setting the same value of bandwidth in meanShiftR package as in MeanShift::msClustering, but just replicated to match the number of columns, I've obtained totally different results, in particular much larger number of cluster. Also, the modes were rather very similar and not representative of the dataset. That made me wonder if this package works correct. Have someone even used meanShiftR? If so, maybe you could present any example as the documentation is not clear enough for me?
This isn't actually different.
One scalar per query point.

Laplace expansion for determinants with r

I have a 21*21 matrix. I would like to use R in order to apply laplace extension along the first row so to display only the last step for the calculation of the determinant (2x2 matrix).
Unluckily, despite my best efforts, I can't figure out how this could be done.
To be clearer I provide an example with a 3x3 matrix
e.g. r <- matrix(c(1:9),3,3)
My aim is to find an expansion along the first row, so to obtain the three cofactors. These cofactors should be visualized so to distinguish the three minor matrices, the multiplying corresponding element of the matrix, and the signs of the permutation.
TO have a visual you can take a look to the first example in (http://en.wikipedia.org/wiki/Laplace_expansion)
Any suggestions? Thank you

How do I generate data from a similarity matrix?

Suppose there are 14 objects, each of which have or do not have 1000 binary features. I have a 14x14 similarity matrix, but not the raw 14x1000 data. Is there a way to reconstruct or generate something similar to the raw data, given the similarity matrix?
I tried Monte Carlo simulations, but unconstrained they would take way too much time to achieve even a low level of consistency with the original similarity matrix.
I saw this relevant question: Similarity matrix -> feature vectors algorithm?. However, they wanted to reduce not increase dimensionality. Also, I am not sure (1) which matrix or matrices to use, and (2) how to convert into a binary matrix.
It's impossible to say for sure unless you describe how the similarity scores were computed.
In general, for the usual kind of similarity scoring this is not possible: information has been lost in the transformation from individual features to aggregate statistics. The best you can hope to do is to arrive at a set of features that are consistent with the similarity scores.
I think that is what you are talking about when you say "similar to" the original. That problem is pretty interesting. Suppose similarity was computed as the dot-product of two feature vectors (ie the count of features for a pair of objects that both have value = 1/true). This is not the only choice: it is consistent with value of 0 (false) meaning no information. But it may generalize to other similarity measures.
In such a case, the problem is really a linear programming problem: a naive approach is to exhaustively search the space of possible objects - not randomly, but guided by the constraints. For example, suppose SIM(A,B) := similarity of object A and object B. Define an order on these vectors.
If SIM(A,B) = N, then choose A=B minimal (like (1,....,1 (N times), 0, .... 0 (1000-N times)), and then choose the minimum C s.t. (A,C), (B,C) have the given values. Once you find an inconsistency, backtrack, and increment.
This will find a consistent answer, although the complexity is very high (but probably better than monte carlo).
Finding a better algorithm is an interesting problem, but more than this I can't say in a SO post - that's probably a topic for a CS thesis!

Finding full QR decomposition from reduced QR

What's the best way to find additional orthonormal columns of Q? I have computed the reduced QR decomposition already, but need the full QR decomposition.
I assume there is a standard approach to this, but I've been having trouble finding it.
You might wonder why I need the full Q matrix. I'm using it to apply a constraint matrix for "natural" splines to a truncated power series basis expansion. I'm doing this in Java, but am looking for a language-independent answer.
Successively add columns to Q in the following way:
Pick a vector not already in the span of Q
Orthogonalize it with respect to the columns of Q
Add the orthogonalized vector as a new column of Q.
Add a row of zeros to the bottom of R
For reference, see these illustrative albeit mathematical lecture notes
Just in case, the process of "orthogonalization" of a new vector is an old technique called the Gram-Schmidt process, and there is a variant which is numerically stable.

Resources