higher order linear regression - math

I have the matrix system:
A x B = C
A is a by n and B is n by b. Both A and B are unknown but I have partial information about C (I have some values in it but not all) and n is picked to be small enough that the system is expected to be over constrained. It is not required that all rows in A or columns in B are over constrained.
I'm looking for something like least squares linear regression to find a best fit for this system (Note: I known there will not be a single unique solution but all I want is one of the best solutions)
To make a concrete example; all the a's and b's are unknown, all the c's are known, and the ?'s are ignored. I want to find a least squares solution only taking into account the know c's.
[ a11, a12 ] [ c11, c12, c13, c14, ? ]
[ a21, a22 ] [ b11, b12, b13, b14, b15] [ c21, c22, c23, c24, c25 ]
[ a31, a32 ] x [ b21, b22, b23, b24, b25] = C ~= [ c31, c32, c33, ?, c35 ]
[ a41, a42 ] [ ?, ?, c43, c44, c45 ]
[ a51, a52 ] [ c51, c52, c53, c54, c55 ]
Note that if B is trimmed to b11 and b21 only and the unknown row 4 chomped out, then this is almost a standard least squares linear regression problem.

This problem is illposed as described.
Let A, B, and C=5, be scalars. You are asking to solve
a*b=5
which has an infinite number of solutions.
One approach, on the information provided above, is to minimize
the function g defined as
g(A,B) = ||AB-C||^2 = trace((AB-C)*(AB-C))^2
using Newtons method or a quasi-secant approach (BFGS).
(You can easily compute the gradient here).
M* is the transpose of M and multiplication is implicit.
(The norm is the frobenius norm... I removed the
underscore F as it was not displaying properly)
As this is an inherently nonlinear problem, standard linear
algebra approaches do not apply.
If you provide more information, I may be able to help more.
Some more questions: I think the issue is here is that without
more information, there is no "best solution". We need to
determine a more concrete idea of what we are looking for.
One idea, could be a "sparsest" solution. This area is
a hot area of research, with some of the best minds in the
world working here (See Terry Tao et al. work on Nuclear Norm)
This problem although tractable is still hard.
Unfortunately, I am not yet able to comment, so I will add my comments here.
As said below, LM is a great approach to solving this and is just one approach.
along the lines of the Newton type approaches to either
the optimization problem or the nonlinear solving problem.
Here is an idea, using the example you gave above: Lets define
two new vectors, V and U each with 21 elements (exactly the same number of defined
elements in C).
V is precisely the known elements of C, column ordered, so (in matlab notation)
V = [C11; C21; C31; C51; C12; .... ; C55]
U is a vector which is a column ordering of the product AB, LEAVING OUT THE
ELEMENTS CORRESPONDING TO '?' in matrix C. Collecting all the variables into x
we have
x = [a11, a21, .. a52, b11, b21 ..., b25].
f(x) = U (as defined above).
We can now try to solve f(x)=V with your favorite nonlinear least squares method.
As an aside, although a poster below recommended simulated annealing, I recommend
against it. THere are some problems it works, but it is a heuristic. When you have
powerful analytic methods such as Gauss-Newton or LM, I say use them. (in my own
experience that is)

A wild guess: A singular value decomposition might do the trick?

I have no idea on how to deal with your missing values, so I'm going to ignore that problem.
There are no unique solutions. To find a best solution you need some sort of a metric to judge them by. I'm going to suppose you want to use a least squares metric, i.e. the best guess values of A and B are those that minimize sum of the numbers [C_ij-(A B)_ij]^2.
One thing you didn't mention is how to determine the value you are going to use for n. In short, we can come up with 'good' solutions if 1 <= n <= b. This is because 1 <= rank(span(C)) <= b. Where rank(span(C)) = the dimension of the column space of C. Note that this is assuming a >= b. To be more correct we would write 1 <= rank(span(C)) <= min(a,b).
Now, supposing that you have chosen n such that 1 <= n <= b. You are going to minimize the residual sum of squares if you chose the columns of A such that span(A) = span(First n eigen vectors of C). If you don't have any other good reasons, just choose the columns of A to be to first n eigen vectors of C. Once you have chosen A, you can get the values of B in the usual linear regression way. I.e. B = (A'A)^(-1)A' C

You have a couple of options. The Levenberg-Marquadt algorithm is generally recognized as the best LS method. A free implementation is available at here. However, if the calculation is fast and you have a decent number of parameters, I would strongly suggest a Monte Carlo method such as simulated annealing.
You start with some set of parameters in the answer, and then you increase one of them by a random percentage up to a maximum. You then calculate the fitness function for your system. Now, here's the trick. You don't throw away the bad answers. You accept them with a Boltzmann probability distribution.
P = exp(-(x-x0)/T)
where T is a temperature parameter and x-x0 is the current fitness value minus the previous. After x number of iterations, you decrease T by a fixed amount (this is called the cooling schedule). You then repeat this process for another random parameter. As T decreases, fewer poor solutions are chosen, and eventually the procedure becomes a "greedy search" only accepting the solutions that improve the fit. If your system has many free parameters (> 10 or so), this is really the only way to go where you will have any chance of getting to a global minimum. This fitting method takes about 20 minutes to write in code, and a couple of hours to tweak. Hope this helps.
FYI, Wolfram has a nice discussion of this in the context of the traveling salesman problem, and I've been using it very successfully to solve some very difficult global minimization problems. It is slower than LM methods, but much better in most difficult/relatively large cases.

Based on the realization that cutting B to a single column and them removing row with unknowns converts this to very near a known problem, One approach would be to:
seed A with random values.
solve for each column of B independently.
rework the problem to allow solving for each row of A given the B values from step 2.
repeat at step 2 until things settle out.
I have no clue if that is even stable.

Related

Questions about SVD, Singular Value Decomposition

I am not a mathematician, so I need to understand what SVD does and WHY more than how it works exactly from the math perspective. (I understand at least what is the decomposition though).
This guy on youtube gave the only human explanation of SVD saying, that the U matrix maps "user to concept correlation" Sigma matrix defines the strength of each concept, and V maps "movie to concept correlation" given that initial matrix M has users in the rows, and movie (ratings) in the columns.
He also mentioned two concept specifically "sci fi" and "romance" movies. See the picture below.
My questions are:
How SVD knows the number of concepts. He as human mentioned two - sci fi, and romance, but in reality in resulting matrices are 3 concepts. (for example matrix U - that one with blue titles - has 3 columns not 2).
How SVD knows what is the concept after all. I mean, what If i shuffle the columns randomly how SVD then knows what is sci fi, what is romance. I mean, I suppose there is no rule, group the concepts together in the column order. What if scifi movie is the first and last one? and not first 3 columns in the initial matrix M?
What is the practical usage of either U, Sigma or V matrices? (Except that you can multiply them to get the initial matrix M)
Is there also any other possible human explanation of SVD than the guy up provided, or it is the only one possible function? Matrices of correlations.
As was pointed out in the comments you may well get better explanations elsewhere. However since the question is still open, here is my tuppence worth.
Throughout I'll suppose that A is mxn where m>=n, ie that A has more rows than columns.
First of all there are many forms of the SVD, differing in the sizes of the matrices. They all share the fundamental properties that
A = U*S*V'
S is diagonal
U and V have orthogonal columns (ie U'*U = I, V'*V = I)
Perhaps the most useful from a theoretical point of view is the 'full fat' svd where we have that U is mxm, S is mxn and V is nxn. However this has rather a lot of elements that don't really contribute to A. For example S being diagonal we can write
S = ( S1 ) (where S1 is nxn )
( 0 )
If we divide up U into
U = ( U1 U2) (where U1 is mxn and U2 is (mx(m-n)))
Then its straightforward to calculate that
U*S = U1*S1
and so we can throw away the last m-n columns of U and the last m-n rows of S, and still recover A.
Moreover some of the diagonal elements of S1 may be 0; suppose in fact that p<n of them are non zero. Then we can write
S1 = ( S2 0)
( 0 0)
And arguing as above for U and analogously for V' we can in fact throw away all but the first p columns of U and all of S but S2, and all but the first p rows of V, and still recover A.
This latter is the form of SVD ('thin') in your question:
U is mxp
S is pxp
V' is pxm
where p is the number of non-zero singular values of A. This is my answer to your 1.
By convention the elements of S decrease as you move down the diagonal. To achieve this the routine that calculates the svd in effect works with a version of A with shuffled columns. This shuffling is undone by incorporating the shuffle in the U and V' output. This is my answer to your 2: however you shuffle A, it will be in effect shuffled again to ensure that the singular values decrease down the diagonal.
I struggle to answer 3, because I suspect that our ideas of 'practical' are rather different.
One thing that I think practical is to find simpler approximations to A. The reconstruction of A can be written
A = Sum{ 1<=i<=p | U[i]*S[i]*V[i]' }
where the S[i] are the diagonal elements of S, U[i] are the columns of U and V[i] those of V
We might want to use a simpler model for A, for example want to simplify it down to just one term. That is, we might wonder how much we would lose by using fewer 'concepts'. The 'thin' svd above has already done this in the sense that it has thrown away all the coluns that make no contribution to A. In an extreme case, we might wonder what we would get if we reduced to just one concept. This approximation is found by taking just the first term of the sum above. This extends to however many terms -- q say -- we want to allow: we just take the first q terms of the sum above.
I'm sorry, I can't answer 4.

lpsolve - unfeasible solution, but I have example of 1

I'm trying to solve this in LPSolve IDE:
/* Objective function */
min: x + y;
/* Variable bounds */
r_1: 2x = 2y;
r_2: x + y = 1.11 x y;
r_3: x >= 1;
r_4: y >= 1;
but the response I get is:
Model name: 'LPSolver' - run #1
Objective: Minimize(R0)
SUBMITTED
Model size: 4 constraints, 2 variables, 5 non-zeros.
Sets: 0 GUB, 0 SOS.
Using DUAL simplex for phase 1 and PRIMAL simplex for phase 2.
The primal and dual simplex pricing strategy set to 'Devex'.
The model is INFEASIBLE
lp_solve unsuccessful after 2 iter and a last best value of 1e+030
How come this can happen when x=1.801801802 and y=1.801801802 are possible solutions here?
How To Find The Solution
Let's do some math.
Your problem is:
min x+y
s.t. 2x = 2y
x + y = 1.11 x y
x >= 1
y >= 1
The first constraint 2x = 2y can be simplified to x=y. We now substitute throughout the problem:
min 2*x
s.t. 2*x = 1.11 x^2
x >= 1
And rearrange:
min 2*x
s.t. 1.11 x^2-2*x=0
x >= 1
From geometry we know that 1.11 x^2-2*x makes an upward-opening parabola with a minimum less than zero. Therefore, there are exactly two points. These are given by the quadratic equation: 200/111 and 0.
Only one of these satisfies the second constraint: 200/111.
Why Can't I Find This Constraint With My Solver
The easy way out is to say it's because the x^2 term (x*y before the substitution is nonlinear). But it goes a little deeper than that. Nonlinear problems can be easy to solve as long as they are convex. A convex problem is one whose constraints form a single, contiguous space such that any line drawn between two points in the space stays within the boundaries of the space.
Your problem is not convex. The constraint 1.11 x^2-2*x=0 defines an infinite number of points. No two of these points can be connected by a straight line which stays in the space defined by the constraint because that space is curved. If the constraint were instead 1.11 x^2-2*x<=0 then the space would be convex because all points could be connected with straight lines that stay in its interior.
Nonconvex problems are part of a broader class of problems called NP-Hard. This means that there is not (and perhaps cannot) be any easy way of solving the problem. We have to be smart.
Solvers that can handle mixed-integer programming (MIP/MILP) can solve many non-convex problems efficiently, as can other techniques such as genetic algorithms. But, beneath the hood, these techniques all rely on glorified guess-and-check.
So your solver fails because the problem is nonconvex and your solver is neither smart enough to use MIP to guess-and-check its way to a solution nor smart enough to use the quadratic equation.
How Then Can I Solve The Problem?
In this particular instance, we are able to use mathematics to quickly find a solution because, although the problem is nonconvex, it is part of a class of special cases. Deep thinking by mathematicians has given us a simple way of handling this class.
But consider a few generalizations of the problem:
(a) a x^3+b x^2+c x+d=0
(b) a x^4+b x^3+c x^2+d x+e =0
(c) a x^5+b x^4+c x^3+d x^2+e x+f=0
(a) has three potential solutions which must be checked (exact solutions are tricky), (b) has four (trickier), and (c) has five. The formulas for (a) and (b) are much more complex than the quadratic formula and mathematicians have shown that there is no formula for (c) that can be expressed using "elementary operations". Instead, we have to resort to glorified guess-and-check.
So the techniques we used to solve your problem don't generalize very well. This is what it means to live in the realm of the nonconvex and NP-hard, and it's a good reason to fund research in mathematics, computer science, and related fields.

Simple Orthographic Structure from Motion using R -- Determining Metric Constraints

I would like to build a simple structure from motion program according to Tomasi and Kanade [1992]. The article can be found below:
https://people.eecs.berkeley.edu/~yang/courses/cs294-6/papers/TomasiC_Shape%20and%20motion%20from%20image%20streams%20under%20orthography.pdf
This method seems elegant and simple, however, I am having trouble calculating the metric constraints outlined in equation 16 of the above reference.
I am using R and have outlined my work thus far below:
Given a set of images
I want to track the corners of the three cabinet doors and the one picture (black points on images). First we read in the points as a matrix w where
Ultimately, we want to factorize w into a rotation matrix R and shape matrix S that describe the 3 dimensional points. I will spare as many details as I can but a complete description of the maths can be gleaned from the Tomasi and Kanade [1992] paper.
I supply w below:
w.vector=c(0.2076,0.1369,0.1918,0.1862,0.1741,0.1434,0.176,0.1723,0.2047,0.233,0.3593,0.3668,0.3744,0.3593,0.3876,0.3574,0.3639,0.3062,0.3295,0.3267,0.3128,0.2811,0.2979,0.2876,0.2782,0.2876,0.3838,0.3819,0.3819,0.3649,0.3913,0.3555,0.3593,0.2997,0.3202,0.3137,0.31,0.2718,0.2895,0.2867,0.825,0.7703,0.742,0.7251,0.7232,0.7138,0.7345,0.6911,0.1937,0.1248,0.1723,0.1741,0.1657,0.1313,0.162,0.1657,0.8834,0.8118,0.7552,0.727,0.7364,0.7232,0.7288,0.6892,0.4309,0.3798,0.4021,0.3965,0.3844,0.3546,0.3695,0.3583,0.314,0.3065,0.3989,0.3876,0.3857,0.3781,0.3989,0.3593,0.5184,0.4849,0.5147,0.5193,0.5109,0.4812,0.4979,0.4849,0.3536,0.3517,0.4121,0.3951,0.3951,0.3781,0.397,0.348,0.5175,0.484,0.5091,0.5147,0.5128,0.4784,0.4905,0.4821,0.7722,0.7326,0.7326,0.7232,0.7232,0.7119,0.7402,0.7006,0.4281,0.3779,0.3918,0.3863,0.3825,0.3472,0.3611,0.3537,0.8043,0.7628,0.7458,0.7288,0.727,0.7213,0.7364,0.6949,0.5789,0.5491,0.5761,0.5817,0.5733,0.5444,0.5537,0.5379,0.3649,0.3536,0.4177,0.3951,0.3857,0.3819,0.397,0.3461,0.697,0.671,0.6821,0.6821,0.6719,0.6412,0.6468,0.6235,0.3744,0.3649,0.4159,0.3819,0.3781,0.3612,0.3763,0.314,0.7008,0.6691,0.6794,0.6812,0.6747,0.6393,0.6412,0.6235,0.7571,0.7345,0.7439,0.7496,0.7402,0.742,0.7647,0.7213,0.5817,0.5463,0.5696,0.5779,0.5761,0.5398,0.551,0.5398,0.7665,0.7326,0.7439,0.7345,0.7288,0.727,0.7515,0.7062,0.8301,0.818,0.8571,0.8878,0.8766,0.8561,0.858,0.8394,0.4121,0.3876,0.4347,0.397,0.38,0.3631,0.3668,0.2971,0.912,0.8962,0.9185,0.939,0.9259,0.898,0.8887,0.8571,0.3989,0.3781,0.4215,0.3725,0.3612,0.3461,0.3423,0.2782,0.9092,0.8952,0.9176,0.9399,0.925,0.8971,0.8887,0.8571,0.4743,0.4536,0.4894,0.4517,0.446,0.4328,0.4385,0.3706,0.8273,0.8171,0.8571,0.8878,0.8766,0.8543,0.8561,0.8394,0.4743,0.4554,0.4969,0.4668,0.4536,0.4404,0.4536,0.3857)
w=matrix(w.vector,ncol=16,nrow=16,byrow=FALSE)
Then create registered measurement matrix wm according to equation 2 as
by
wm = w - rowMeans(w)
We can decompose wm into a '2FxP' matrix o1 a diagonal 'PxP' matrix e and 'PxP' matrix o2 by using a singular value decomposition.
svdwm <- svd(wm)
o1 <- svdwm$u
e <- diag(svdwm$d)
o2 <- t(svdwm$v) ## dont forget the transpose!
However, because of noise, we only pay attention to the first 3 columns of o1, first 3 values of e and the first 3 rows of o2 by:
o1p <- svdwm$u[,1:3]
ep <- diag(svdwm$d[1:3])
o2p <- t(svdwm$v)[1:3,] ## dont forget the transpose!
Now we can solve for our rhat and shat in equation (14)
by
rhat <- o1p%*%ep^(1/2)
shat <- ep^(1/2) %*% o2p
However, these results are not unique and we still need to solve for R and S by equation (15)
by using the metric constraints of equation (16)
Now I need to find Q. I believe there are two potential methods but am unclear how to employ either.
Method 1 involves solving for B where B=Q%*%solve(Q) then using Cholesky decomposition to find Q. Method 1 appears to be the common choice in literature, however, little detail is given as to how to actually solve the linear system. It is apparent that B is a '3x3' symmetric matrix of 6 unknowns. However, given the metric constraints (equations 16), I don't know how to solve for 6 unknowns given 3 equations. Am I forgetting a property of symmetric matrices?
Method II involves using non-linear methods to estimate Q and is less commonly used in structure from motion literature.
Can anyone offer some advice as to how to go about solving this problem? Thanks in advance and let me know if I need to be more clear in my question.
can be written as .
can be written as .
can be written as .
so our equations are:
So the first equation can be written as:
which is equivalent to
To keep it short we define now:
(I know the spacings are terrably small, but yes, this is a Vector...)
So for all equations in all different Frames f, we can write one big equation:
(sorry for the ugly formulas...)
Now you just need to solve the -Matrix using Cholesky decomposition or whatever...

How to find which subset of bitfields xor to another bitfield?

I have a somewhat math-oriented problem. I have a bunch of bitfields and would like to calculate what subset of them to xor together to achieve a certain other bitfield, or if there isn't a way to do it discover that no such subset exists.
I'd like to do this using a free library, rather than original code, and I'd strongly prefer something with Python bindings (using Python's built-in math libraries would be acceptable as well, but I want to port this to multiple languages eventually). Also it would be good to not take the memory hit of having to expand each bit to its own byte.
Some further clarification: I only need a single solution. My matrices are the opposite of sparse. I'm very interested in keeping the runtime to an absolute minimum, so using algorithmically fancy methods for inverting matrices is strongly preferred. Also, it's very important that the specific given bitfield be the one outputted, so a technique which just finds a subset which xor to 0 doesn't quite cut it.
And I'm generally aware of gaussian elimination. I'm trying to avoid doing this from scratch!
cross-posted to mathoverflow, because it isn't clear what the right place for this question is - https://mathoverflow.net/questions/41036/how-to-find-which-subset-of-bitfields-xor-to-another-bitfield
Mathematically speaking, XOR of two bits can be treated as addition in F_2 field.
You want to solve a set of equations in a F_2 field. For four bitfiels with bits (a_0, a_1, ... a_n), (b_0, b_1, ..., b_n), (c_0, c_1, ..., c_n), (r_0, r_1, ..., r_n), you get equations:
x * a_0 + y * b_0 + z * c_0 = r_0
x * a_1 + y * b_1 + z * c_1 = r_1
...
x * a_n + y * b_n + z * c_n = r_n
(where you look for x, y, z).
You could program this as a simple integer linear problem with glpk, probably lp_solve (but I don't remember if it will fit). These might work very slowly though, as they are trying to solve much more general problem.
After googling for a while, it seems that this page might be a good start looking for code. From descriptions it seems that Dixon and LinBox could be a good fit.
Anyway, I think asking at mathoverflow might give you more precise answers. If you do, please link your question here.
Update: Sagemath uses M4RI for solving this problem. This makes it (for me) a very good recommendation.
For small instances that easily fit in memory, this is just solving a linear system over F_2, so try mod-2 Gaussian elimination. For very large sparse instances, like those that occur in factoring (sieve) algorithms, look up the Wiedemann algorithm.
It's possible to have multiple subsets xor to the same value; do you care about finding all subsets?
A perhaps heavy-handed approach would be to filter the powerset of bitfields. In Haskell:
import Data.Bits
xorsTo :: Int -> [Int] -> [[Int]]
xorsTo target fields = filter xorsToTarget (powerset fields)
where xorsToTarget f = (foldl xor 0 f) == target
powerset [] = [[]]
powerset (x:xs) = powerset xs ++ map (x:) (powerset xs)
Not sure if there is a way to do this without generating the powerset. (In the worst case, it is possible for the solution to actually be the entire powerset).
expanding on liori's answer above we have a linear system of equations (in modulo 2):
a0, b0, c0 ...| r0
a1, b1, c1 ...| r1
... |
an, bn, cn ...| rn
Gaussian elimination can be used to solve the system. In modulo 2, the add row operation becomes an XOR operation. It is much simpler computationally to do this than to use a generic linear systems solver.
So, if a0 is zero we swap up a row that has a 1 in the a position. Then perform an XOR (using row 0) on any other row whos "a" bit is a 1. Then repeat using row 1 and column b, then row 2 col c, etc.
If you get a row of zeroes with a non-zero in the r column then the subset DNE.

Multinomial Generation of Degree n

I'm basically looking for a summation function that will compute multinomials given the number of variables and a degree.
Example
2 Variables; 2 Degrees:
x^2+y^2+x*y+x+y+1
Thanks.
See Knuth The Art of Computer Programming, Vol. 4, Fascicle 3 for a comprehensive answer.
Short answer: it's enough to generate all multinomial expressions in n variables with degree exactly d. Then, for your problem, you can either put together the answers with degrees ≤d, or add a dummy variable "1".
The problem of generating all expressions with degree exactly d is thus simply one of generating all ordered partitions (i.e., all nonnegative integer solutions to x1 + ... + xn = d), and this can be done with a simple backtracking algorithm. ("Depth-first search")
Given N variables, and a maximum degree of D, you have an array of D slots to fill with all possible combinations of variables.
[_, _, ..., _, _]
You are allowed to fill the slots with any of the N variables any number <= D times total. Since multiplication is commutative, it suffices to not care about ordering of variables. As such, this problem is reduced to generating (1) partitions of an integer and (2) subsets of a set.
I hope this is at least a start to your solution.
This also seems to be a Dynamic programming variant of the 0-1 Knapsack problem. Here we would be interested in all possible leaves of the decision tree.

Resources