Questions about SVD, Singular Value Decomposition - math

I am not a mathematician, so I need to understand what SVD does and WHY more than how it works exactly from the math perspective. (I understand at least what is the decomposition though).
This guy on youtube gave the only human explanation of SVD saying, that the U matrix maps "user to concept correlation" Sigma matrix defines the strength of each concept, and V maps "movie to concept correlation" given that initial matrix M has users in the rows, and movie (ratings) in the columns.
He also mentioned two concept specifically "sci fi" and "romance" movies. See the picture below.
My questions are:
How SVD knows the number of concepts. He as human mentioned two - sci fi, and romance, but in reality in resulting matrices are 3 concepts. (for example matrix U - that one with blue titles - has 3 columns not 2).
How SVD knows what is the concept after all. I mean, what If i shuffle the columns randomly how SVD then knows what is sci fi, what is romance. I mean, I suppose there is no rule, group the concepts together in the column order. What if scifi movie is the first and last one? and not first 3 columns in the initial matrix M?
What is the practical usage of either U, Sigma or V matrices? (Except that you can multiply them to get the initial matrix M)
Is there also any other possible human explanation of SVD than the guy up provided, or it is the only one possible function? Matrices of correlations.

As was pointed out in the comments you may well get better explanations elsewhere. However since the question is still open, here is my tuppence worth.
Throughout I'll suppose that A is mxn where m>=n, ie that A has more rows than columns.
First of all there are many forms of the SVD, differing in the sizes of the matrices. They all share the fundamental properties that
A = U*S*V'
S is diagonal
U and V have orthogonal columns (ie U'*U = I, V'*V = I)
Perhaps the most useful from a theoretical point of view is the 'full fat' svd where we have that U is mxm, S is mxn and V is nxn. However this has rather a lot of elements that don't really contribute to A. For example S being diagonal we can write
S = ( S1 ) (where S1 is nxn )
( 0 )
If we divide up U into
U = ( U1 U2) (where U1 is mxn and U2 is (mx(m-n)))
Then its straightforward to calculate that
U*S = U1*S1
and so we can throw away the last m-n columns of U and the last m-n rows of S, and still recover A.
Moreover some of the diagonal elements of S1 may be 0; suppose in fact that p<n of them are non zero. Then we can write
S1 = ( S2 0)
( 0 0)
And arguing as above for U and analogously for V' we can in fact throw away all but the first p columns of U and all of S but S2, and all but the first p rows of V, and still recover A.
This latter is the form of SVD ('thin') in your question:
U is mxp
S is pxp
V' is pxm
where p is the number of non-zero singular values of A. This is my answer to your 1.
By convention the elements of S decrease as you move down the diagonal. To achieve this the routine that calculates the svd in effect works with a version of A with shuffled columns. This shuffling is undone by incorporating the shuffle in the U and V' output. This is my answer to your 2: however you shuffle A, it will be in effect shuffled again to ensure that the singular values decrease down the diagonal.
I struggle to answer 3, because I suspect that our ideas of 'practical' are rather different.
One thing that I think practical is to find simpler approximations to A. The reconstruction of A can be written
A = Sum{ 1<=i<=p | U[i]*S[i]*V[i]' }
where the S[i] are the diagonal elements of S, U[i] are the columns of U and V[i] those of V
We might want to use a simpler model for A, for example want to simplify it down to just one term. That is, we might wonder how much we would lose by using fewer 'concepts'. The 'thin' svd above has already done this in the sense that it has thrown away all the coluns that make no contribution to A. In an extreme case, we might wonder what we would get if we reduced to just one concept. This approximation is found by taking just the first term of the sum above. This extends to however many terms -- q say -- we want to allow: we just take the first q terms of the sum above.
I'm sorry, I can't answer 4.

Related

covariance formula: multiplying just the weights "in couple" in R

ok basically if you look at the covariance formula when weights are involved (look at this picture so everything is clear http://postimg.org/image/sjr2tnk85/), I just want to calculate the sum of all the different couples of weights as highlighted in the link of the picture I uploaded.
I absolutely need that specific quantity highlighted in the picture. I have no use of the formulas cor() [i tried but it was useless]
I have tried to use "for" loops trying to following the mathematical formula but came out empty handed.
I am sorry if this post lacks the specificity required for this forum but it was the best way I could think of in order to explain my problem.
sum(outer(w,w), -crossprod(w)) / 2
Z <- outer(a,b) creates a matrix where Z[i,j] = a[i]*b[j]. Plugging in w for both a and b, this is a symmetric matrix.
crossprod(x) calculates the sums of squares of x. This is the sum of the diagonals of the above matrix.
Take the difference, then divide by two because you only want the top half of the matrix.
Alternatively, you could try sum( apply(combn(w,2), 2, prod) ) to explicitly form each pair, multiply them, and sum them up.

Matrix and multiplication complexity

I'm having trouble understanding the time complexity of the solution to a problem.
Let X, Y and Z be n × n matrices. Suppose we want to verify whether XY = Z. What is the complexity of solving the problem directly by computing XY?
The correct answer is O(n3), but I don't understand why. Why is this the case?
The standard algorithm for computing the product of two n × n matrices is to use the fact that the entry at position (i, j) in the product is the inner product of the ith row of the first matrix and the jth column of the second matrix. Computing the inner product of this row and this column takes time Θ(n), because there are n entries that need to be pairwise multiplied and summed together. Therefore, each entry of the resulting matrix takes time Θ(n). Since there are n2 total entries in the matrices, the total time complexity using the naive algorithm is Θ(n3).
There are faster matrix multiplication approaches than the one described here that use more sophisticated algorithms. You might want to look up the Coppersmith-Winograd algorithm or the Strassen algorithm, which are asymptotically faster than the naive algorithm.
However, there are better randomized algorithms for checking whether the product is correct. Check out Frievalds' algorithm for an O(n2)-time randomized algorithm that with high probability can detect if the multiplication is correct.
Hope this helps!

Multinomial Generation of Degree n

I'm basically looking for a summation function that will compute multinomials given the number of variables and a degree.
Example
2 Variables; 2 Degrees:
x^2+y^2+x*y+x+y+1
Thanks.
See Knuth The Art of Computer Programming, Vol. 4, Fascicle 3 for a comprehensive answer.
Short answer: it's enough to generate all multinomial expressions in n variables with degree exactly d. Then, for your problem, you can either put together the answers with degrees ≤d, or add a dummy variable "1".
The problem of generating all expressions with degree exactly d is thus simply one of generating all ordered partitions (i.e., all nonnegative integer solutions to x1 + ... + xn = d), and this can be done with a simple backtracking algorithm. ("Depth-first search")
Given N variables, and a maximum degree of D, you have an array of D slots to fill with all possible combinations of variables.
[_, _, ..., _, _]
You are allowed to fill the slots with any of the N variables any number <= D times total. Since multiplication is commutative, it suffices to not care about ordering of variables. As such, this problem is reduced to generating (1) partitions of an integer and (2) subsets of a set.
I hope this is at least a start to your solution.
This also seems to be a Dynamic programming variant of the 0-1 Knapsack problem. Here we would be interested in all possible leaves of the decision tree.

higher order linear regression

I have the matrix system:
A x B = C
A is a by n and B is n by b. Both A and B are unknown but I have partial information about C (I have some values in it but not all) and n is picked to be small enough that the system is expected to be over constrained. It is not required that all rows in A or columns in B are over constrained.
I'm looking for something like least squares linear regression to find a best fit for this system (Note: I known there will not be a single unique solution but all I want is one of the best solutions)
To make a concrete example; all the a's and b's are unknown, all the c's are known, and the ?'s are ignored. I want to find a least squares solution only taking into account the know c's.
[ a11, a12 ] [ c11, c12, c13, c14, ? ]
[ a21, a22 ] [ b11, b12, b13, b14, b15] [ c21, c22, c23, c24, c25 ]
[ a31, a32 ] x [ b21, b22, b23, b24, b25] = C ~= [ c31, c32, c33, ?, c35 ]
[ a41, a42 ] [ ?, ?, c43, c44, c45 ]
[ a51, a52 ] [ c51, c52, c53, c54, c55 ]
Note that if B is trimmed to b11 and b21 only and the unknown row 4 chomped out, then this is almost a standard least squares linear regression problem.
This problem is illposed as described.
Let A, B, and C=5, be scalars. You are asking to solve
a*b=5
which has an infinite number of solutions.
One approach, on the information provided above, is to minimize
the function g defined as
g(A,B) = ||AB-C||^2 = trace((AB-C)*(AB-C))^2
using Newtons method or a quasi-secant approach (BFGS).
(You can easily compute the gradient here).
M* is the transpose of M and multiplication is implicit.
(The norm is the frobenius norm... I removed the
underscore F as it was not displaying properly)
As this is an inherently nonlinear problem, standard linear
algebra approaches do not apply.
If you provide more information, I may be able to help more.
Some more questions: I think the issue is here is that without
more information, there is no "best solution". We need to
determine a more concrete idea of what we are looking for.
One idea, could be a "sparsest" solution. This area is
a hot area of research, with some of the best minds in the
world working here (See Terry Tao et al. work on Nuclear Norm)
This problem although tractable is still hard.
Unfortunately, I am not yet able to comment, so I will add my comments here.
As said below, LM is a great approach to solving this and is just one approach.
along the lines of the Newton type approaches to either
the optimization problem or the nonlinear solving problem.
Here is an idea, using the example you gave above: Lets define
two new vectors, V and U each with 21 elements (exactly the same number of defined
elements in C).
V is precisely the known elements of C, column ordered, so (in matlab notation)
V = [C11; C21; C31; C51; C12; .... ; C55]
U is a vector which is a column ordering of the product AB, LEAVING OUT THE
ELEMENTS CORRESPONDING TO '?' in matrix C. Collecting all the variables into x
we have
x = [a11, a21, .. a52, b11, b21 ..., b25].
f(x) = U (as defined above).
We can now try to solve f(x)=V with your favorite nonlinear least squares method.
As an aside, although a poster below recommended simulated annealing, I recommend
against it. THere are some problems it works, but it is a heuristic. When you have
powerful analytic methods such as Gauss-Newton or LM, I say use them. (in my own
experience that is)
A wild guess: A singular value decomposition might do the trick?
I have no idea on how to deal with your missing values, so I'm going to ignore that problem.
There are no unique solutions. To find a best solution you need some sort of a metric to judge them by. I'm going to suppose you want to use a least squares metric, i.e. the best guess values of A and B are those that minimize sum of the numbers [C_ij-(A B)_ij]^2.
One thing you didn't mention is how to determine the value you are going to use for n. In short, we can come up with 'good' solutions if 1 <= n <= b. This is because 1 <= rank(span(C)) <= b. Where rank(span(C)) = the dimension of the column space of C. Note that this is assuming a >= b. To be more correct we would write 1 <= rank(span(C)) <= min(a,b).
Now, supposing that you have chosen n such that 1 <= n <= b. You are going to minimize the residual sum of squares if you chose the columns of A such that span(A) = span(First n eigen vectors of C). If you don't have any other good reasons, just choose the columns of A to be to first n eigen vectors of C. Once you have chosen A, you can get the values of B in the usual linear regression way. I.e. B = (A'A)^(-1)A' C
You have a couple of options. The Levenberg-Marquadt algorithm is generally recognized as the best LS method. A free implementation is available at here. However, if the calculation is fast and you have a decent number of parameters, I would strongly suggest a Monte Carlo method such as simulated annealing.
You start with some set of parameters in the answer, and then you increase one of them by a random percentage up to a maximum. You then calculate the fitness function for your system. Now, here's the trick. You don't throw away the bad answers. You accept them with a Boltzmann probability distribution.
P = exp(-(x-x0)/T)
where T is a temperature parameter and x-x0 is the current fitness value minus the previous. After x number of iterations, you decrease T by a fixed amount (this is called the cooling schedule). You then repeat this process for another random parameter. As T decreases, fewer poor solutions are chosen, and eventually the procedure becomes a "greedy search" only accepting the solutions that improve the fit. If your system has many free parameters (> 10 or so), this is really the only way to go where you will have any chance of getting to a global minimum. This fitting method takes about 20 minutes to write in code, and a couple of hours to tweak. Hope this helps.
FYI, Wolfram has a nice discussion of this in the context of the traveling salesman problem, and I've been using it very successfully to solve some very difficult global minimization problems. It is slower than LM methods, but much better in most difficult/relatively large cases.
Based on the realization that cutting B to a single column and them removing row with unknowns converts this to very near a known problem, One approach would be to:
seed A with random values.
solve for each column of B independently.
rework the problem to allow solving for each row of A given the B values from step 2.
repeat at step 2 until things settle out.
I have no clue if that is even stable.

Is a given set of group elements a set of coset representatives?

I am afraid the question is a bit technical, but I hope someone might have stumbled into a similar subject, or give me a pointer of some kind.
If G is a group (in the sense of algebraic structure), and if g1, ..., gn are elements of G, is there an algorithm (or a function in some dedicated program, like GAP) to determine whether there is a subgroup of G such that those elements form a set of representatives for the cosets of the subgroup? (We may assume that G is a permutation group, and probably even the full symmetric group.)
(There are of course several algorithms to find the cosets of a given subgroups, like Todd-Coxeter algorithm; this is a kind of inverse question.)
Thanks,
Daniele
The only solution I can come up with is naive. Basically if you have elements x1,...,xn, you would use GAP's LowIndexSubgroupsFpGroup to enumerate all subgroups with index n (discarding those with index < n). Then you would go through each such group, generate the cosets, and check that each coset contains one of the elements.
This is all I could think of. I would be very interested if you came up with a better approach.
What you're trying to determine is if there is a subgroup H of G such that {g1, ..., gn} is a transversal of the cosets of H. i.e. A set of representatives of the partitioning of G by the cosets of H.
First, by Lagrange's theorem, |G| = |G:H| * |G|, where |G:H| = |G|/|H| is the index of the subgroup H of G. If {g1, ..., gn} is indeed a transversal, then |G:H| = |{g1, ..., gn}|, so the first test in your algorithm should be whether n divides |G|.
Moreover, since gi and gj are in the same right coset only if gigj-1 is in H, you can then check subgroups with index n to see if they avoid gigj-1. Also, note that (gigj-1)(gjgk-1) = gigk-1, so you can choose any pairing of the gis.
This should be sufficient if n is small compared to |G|.
Another approach is to start with H being the trivial group and add elements of the set H* = {h in G : hk != gigj-1, for all i, j, k; i != j} to the generators of H until you can't add any more (i.e. until it's no longer a subgroup). H is then a maximal subgroup of G such that H is a subset of H*. If you can get all such H (and have them be large enough) then the subgroup you're looking for must be one of them.
This approach would work better for larger n.
Either way a non-exponential-time approach isn't obvious.
EDIT: I've just found a discussion of this very topic here: http://en.wikipedia.org/wiki/Wikipedia:Reference_desk/Archives/Mathematics/2009_April_18#Is_a_given_set_of_group_elements_a_set_of_coset_representatives.3F
A slightly less brute approach would be to enumerate all subgroups of index n, as Il-Bhima suggested, and then for each subgroup, check each xi * xj-1 to see if it is contained in the subgroup.
The elements x1, ..., xn will be representatives for a subgroup if and only if EVERY product
xi * xj-1 where (i != j)
is NOT in the subgroup.
This type of check seems both simpler than generating all cosets, and computationally faster.

Resources