R Solver for combinations - r

I have a data frame which contains decimal numbers in 5 columns. Try to think it as an excel file. For every column I find the sum. I provide an example below:
The problem is that I want to keep the optimal lines/observations that they will:
Have individual sum as much closer to 0 - for the col1, col2 and col3
AND simultaneously the col4 and col5 to sum (again individually) the closer to (let's say 3).
The example has created with a random sample, so it does not contain negative numbers, but I provided it as an example.
I think that problems of this case are solved via a Solver in R. I would like to find a code that solves the above problem.

I used before the package lpSolve with the function lp to solve linear programming problems with constraints. It will be easy to set up as you don't appear to have many constraints. However, because it is a single objective function, you need to define the objective function probably as the minimum difference of the sum of your 2 branches (read: Cols 1 to 3, and Cols 4 to 5).
Potentially there would be other methods but this one should be rather easy.
Hopefully this will help.
Regards

I think there are two obvious ways of doing this.
I'll write down the mathematical models that make more formal my interpretation of your problem.
Quadratic Formulation
The first is a least squares approach. Let
x(i) = 1 if row i is selected
0 otherwise
Then we can write:
min sum(j, w(j)*d(j)^2)
d(j) = sum(i, a(i,j)*x(i)) - t(j)
d(j) : free variable (can be substituted out if wanted)
where
t(j) : target sum for column j (0 and 3 in your example)
w(j) : weight for column j (choose 1 if there is no good reason to use something else)
a(i,j) : your data matrix (or data frame)
This is called a MIQP model (Mixed Integer Quadratic Programming). There are MIQP solvers available for R.
Linear Formulation
Instead of least squares we can choose to use least absolute deviations. A high-level model can look like:
min sum(j, w(j)*|d(j)|)
d(j) = sum(i, a(i,j)*x(i)) - t(j)
d(j) : free variable (can be substituted out if wanted)
To make this a proper MIP (Mixed Integer Programming) model we need to make everything linear. One possible formulation is:
min sum(j, w(j)*y(j))
d(j) = sum(i, a(i,j)*x(i)) - t(j)
-y(j) <= d(j) <= y(j) (we can write this as two inequalities)
d(j) : free variable
y(j) >= 0 (positive variable)
There are many MIP solvers available for use with R.

Related

How to implement != (not equal) in lpSolve r

Since lpSolve does not allow to use != for the constraint directions, what is an alternative way to get the same result?
I would like to maximize x1 + x2 with constraints: x1 <= 5 and
x2 != 5
and keep using lpSolve R package.
I've tried using a combination of > < in order to replicate the same behaviour of !=, however I do not obtain the result I expected.
f.obj<-c(1,1)
f.con<-matrix(c(1,0,0,1),nrow=2,ncol=2,byrow=TRUE)
f.dir<-c("<=","!=")
f.rhs<-c(5,5)
lp("max",f.obj,f.con,f.dir,f.rhs)$solution
Since lpSolve does not support !=, I get the error message:
Error in lp("max",f.obj,f.con,f.dir,f.rhs): Unknown constraint direction found
EDIT
I would like to maximize x1 + x2 with constraints: x1 <= 5 and
x2 < 10 and x2 != 9.
So the solution would be 5 and 8.
You can't do that, even in theory, since the resulting constraint set is not closed. It is like trying to minimize x^2 over the set x > 0. For any proposed solution x0 in that set the solution x0/2 is better so there is no optimum.
I would just use x <= 5 as your constraint and if the constraint is not active (i.e. it turns out that x < 5) then you have found the solution; otherwise, there is no solution. If there is no solution you can try x <= 5 - eps for an arbitrarily chosen eps.
ADDED:
If what you intended was that the variables x1 and x2 are integer then
x < 10 and x != 9
is equivalent to
x <= 8
Note that lp has the all.int argument which defaults to FALSE.
ADDED 2:
If you just want to find multiple feasible solutions then if opt is the value of the objective from the first solution rerun adding the constraint (assuming a maximization problem):
objective <= opt - eps
where eps is an arbitrary small constant.
Also note that if the vectors x1 and x2 are two optimal solutions to an LP then since the constraint set is necessarily convex any convex combination of those solutions is also feasible and because the objective is linear all of those convex combinations must also be optimal so if there is more than one optimum then there are an infinite number of such optimal solutions so you can't simply enumerate them.
ADDED 3.
The feasible set of a linear program form a simplex (i.e. a polytope) and at least one vertex must be at the optimal value if such optimal value exists. If there are more than one vertex with the same optimal value then the points on the line connecting them are all optimal values as well. Although there are an infinite number of optimal values in that case there are only a finite number of vertices so you could enumerate them using the vertexenum package. Then evaluate the objective at each one. If there is one vertex whose objective value is greater than all others then that is the optimum. If there are multiple then we know that those plus all convex combinations of those are optimal. This might work if your problem is not too large.

differentiation in matlab

i need to find acceleration of an object the formula for that given in text is a = d^2(L)/d(T)^2 , where L= length and T= time
i calculated this in matlab by using this equation
a = (1/(T3-T1))*(((L3-L2)/(T3-T2))-((L2-L1)/(T2-T1)))
or
a = (v2-v1)/(T2-T1)
but im not getting the right answers ,can any body tell me how to find (a) by any other method in matlab.
This has nothing to do with matlab, you are just trying to numerically differentiate a function twice. Depending on the behaviour of the higher (3rd, 4th) derivatives of the function this will or will not yield reasonable results. You will also have to expect an error of order |T3 - T1|^2 with a formula like the one you are using, assuming L is four times differentiable. Instead of using intervals of different size you may try to use symmetric approximations like
v (x) = (L(x-h) - L(x+h))/ 2h
a (x) = (L(x-h) - 2 L(x) + L(x+h))/ h^2
From what I recall from my numerical math lectures this is better suited for numerical calculation of higher order derivatives. You will still get an error of order
C |h|^2, with C = O( ||d^4 L / dt^4 || )
with ||.|| denoting the supremum norm of a function (that is, the fourth derivative of L needs to be bounded). In case that's true you can use that formula to calculate how small h has to be chosen in order to produce a result you are willing to accept. Note, though, that this is just the theoretical error which is a consequence of an analysis of the Taylor approximation of L, see [1] or [2] -- this is where I got it from a moment ago -- or any other introductory book on numerical mathematics. You may get additional errors depending on the quality of the evaluation of L; also, if |L(x-h) - L(x)| is very small numerical substraction may be ill conditioned.
[1] Knabner, Angermann; Numerik partieller Differentialgleichungen; Springer
[2] http://math.fullerton.edu/mathews/n2003/numericaldiffmod.html

R, cointegration, multivariate, co.ja(), johansen

I am new to R and cointegration so please have patience with me as I try to explain what it is that I am trying to do. I am trying to find cointegrated variables among 1500-2000 voltage variables in the west power system in Canada/US. THe frequency is hourly (common in power) and cointegrated combinations can be as few as N variables and a maximum of M variables.
I tried to use ca.jo but here are issues that I ran into:
1) ca.jo (Johansen) has a limit to the number of variables it can work with
2) ca.jo appears to force the first variable in the y(t) vector to be the dependent variable (see below).
Eigenvectors, normalised to first column: (These are the cointegration relations)
V1.l2 V2.l2 V3.l2
V1.l2 1.0000000 1.0000000 1.0000000
V2.l2 -0.2597057 -2.3888060 -0.4181294
V3.l2 -0.6443270 -0.6901678 0.5429844
As you can see ca.jo tries to find linear combinations of the 3 variables but by forcing the coefficient on the first variable (in this case V1) to be 1 (i.e. the dependent variable). My understanding was that ca.jo would try to find all combinations such that every variable is selected as a dependent variable. You can see the same treatment in the examples given in the documentation for ca.jo.
3) ca.jo does not appear to find linear combinations of fewer than the number of variables in the y(t) vector. So if there were 5 variables and 3 of them are cointegrated (i.e. V1 ~ V2 + V3) then ca.jo fails to find this combination. Perhaps I am not using ca.jo correctly but my expectation was that a cointegrated combination where V1 ~ V2 + V3 is the same as V1 ~ V2 + V3 + 0 x V4 + 0 x V5. In other words the coefficient of the variable that are NOT cointegrated should be zero and ca.jo should find this type of combination.
I would greatly appreciate some further insight as I am fairly new to R and cointegration and have spent the past 2 months teaching myself.
Thank you.
I have also posted on nabble:
http://r.789695.n4.nabble.com/ca-jo-cointegration-multivariate-case-tc3469210.html
I'm not an expert, but since no one is responding, I'm going to try to take a stab at this one.. EDIT: I noticed that I just answered to a 4 year old question. Hopefully it might still be useful to others in the future.
Your general understanding is correct. I'm not going to go in great detail about the whole procedure but will try to give some general insight. The first thing that the Johansen procedure does is create a VECM out of the VAR model that best corresponds to the data (This is why you need the lag length for the VAR as input to the procedure as well). The procedure will then investigate the non-lagged component matrix of the VECM by looking at its rank: If the variables are not cointegrated then the rank of the matrix will not be significantly different from 0. A more intuitive way of understanding the johansen VECM equations is to notice the comparibility with the ADF procedure for each distinct row of the model.
Furthermore, The rank of the matrix is equal to the number of its eigenvalues (characteristic roots) that are different from zero. Each eigenvalue is associated with a different cointegrating vector, which
is equal to its corresponding eigenvector. Hence, An eigenvalue significantly different
from zero indicates a significant cointegrating vector. Significance of the vectors can be tested with two distinct statistics: The max statistic or the trace statistic. The trace test tests the null hypothesis of less than or equal to r cointegrating vectors against the alternative of more than r cointegrating vectors. In contrast, The maximum eigenvalue test tests the null hypothesis of r cointegrating vectors against the alternative of r + 1 cointegrating vectors.
Now for an example,
# We fit data to a VAR to obtain the optimal VAR length. Use SC information criterion to find optimal model.
varest <- VAR(yourData,p=1,type="const",lag.max=24, ic="SC")
# obtain lag length of VAR that best fits the data
lagLength <- max(2,varest$p)
# Perform Johansen procedure for cointegration
# Allow intercepts in the cointegrating vector: data without zero mean
# Use trace statistic (null hypothesis: number of cointegrating vectors <= r)
res <- ca.jo(yourData,type="trace",ecdet="const",K=lagLength,spec="longrun")
testStatistics <- res#teststat
criticalValues <- res#criticalValues
# chi^2. If testStatic for r<= 0 is greater than the corresponding criticalValue, then r<=0 is rejected and we have at least one cointegrating vector
# We use 90% confidence level to make our decision
if(testStatistics[length(testStatistics)] >= criticalValues[dim(criticalValues)[1],1])
{
# Return eigenvector that has maximum eigenvalue. Note: we throw away the constant!!
return(res#V[1:ncol(yourData),which.max(res#lambda)])
}
This piece of code checks if there is at least one cointegrating vector (r<=0) and then returns the vector with the highest cointegrating properties or in other words, the vector with the highest eigenvalue (lamda).
Regarding your question: the procedure does not "force" anything. It checks all combinations, that is why you have your 3 different vectors. It is my understanding that the method just scales/normalizes the vector to the first variable.
Regarding your other question: The procedure will calculate the vectors for which the residual has the strongest mean reverting / stationarity properties. If one or more of your variables does not contribute further to these properties then the component for this variable in the vector will indeed be 0. However, if the component value is not 0 then it means that "stronger" cointegration was found by including the extra variable in the model.
Furthermore, you can test test significance of your components. Johansen allows a researcher to test a hypothesis about one or more
coefficients in the cointegrating relationship by viewing the hypothesis as
a restriction on the non-lagged component matrix in the VECM. If there exist r cointegrating vectors, only these linear combinations or linear transformations of them, or combinations of the cointegrating vectors, will be stationary. However, I'm not aware on how to perform these extra checks in R.
Probably, the best way for you to proceed is to first test the combinations that contain a smaller number of variables. You then have the option to not add extra variables to these cointegrating subsets if you don't want to. But as already mentioned, adding other variables can potentially increase the cointegrating properties / stationarity of your residuals. It will depend on your requirements whether or not this is the behaviour you want.
I've been searching for an answer to this and I think I found one so I'm sharing with you hoping it's the right solution.
By using the johansen test you test for the ranks (number of cointegration vectors), and it also returns the eigenvectors, and the alphas and betas do build said vectors.
In theory if you reject r=0 and accept r=1 (value of r=0 > critical value and r=1 < critical value) you would search for the highest eigenvalue and from that build your vector. On this case, if the highest eigenvalue was the first, it would be V1*1+V2*(-0.26)+V3*(-0.64).
This would generate the cointegration residuals for these variables.
Again, I'm not 100%, but preety sure the above is how it works.
Nonetheless, you can always use the cajools function from the urca package to create a VECM automatically. You only need to feed it a cajo object and define the number of ranks (https://cran.r-project.org/web/packages/urca/urca.pdf).
If someone could confirm / correct this, it would be appreciated.

Multinomial Generation of Degree n

I'm basically looking for a summation function that will compute multinomials given the number of variables and a degree.
Example
2 Variables; 2 Degrees:
x^2+y^2+x*y+x+y+1
Thanks.
See Knuth The Art of Computer Programming, Vol. 4, Fascicle 3 for a comprehensive answer.
Short answer: it's enough to generate all multinomial expressions in n variables with degree exactly d. Then, for your problem, you can either put together the answers with degrees ≤d, or add a dummy variable "1".
The problem of generating all expressions with degree exactly d is thus simply one of generating all ordered partitions (i.e., all nonnegative integer solutions to x1 + ... + xn = d), and this can be done with a simple backtracking algorithm. ("Depth-first search")
Given N variables, and a maximum degree of D, you have an array of D slots to fill with all possible combinations of variables.
[_, _, ..., _, _]
You are allowed to fill the slots with any of the N variables any number <= D times total. Since multiplication is commutative, it suffices to not care about ordering of variables. As such, this problem is reduced to generating (1) partitions of an integer and (2) subsets of a set.
I hope this is at least a start to your solution.
This also seems to be a Dynamic programming variant of the 0-1 Knapsack problem. Here we would be interested in all possible leaves of the decision tree.

higher order linear regression

I have the matrix system:
A x B = C
A is a by n and B is n by b. Both A and B are unknown but I have partial information about C (I have some values in it but not all) and n is picked to be small enough that the system is expected to be over constrained. It is not required that all rows in A or columns in B are over constrained.
I'm looking for something like least squares linear regression to find a best fit for this system (Note: I known there will not be a single unique solution but all I want is one of the best solutions)
To make a concrete example; all the a's and b's are unknown, all the c's are known, and the ?'s are ignored. I want to find a least squares solution only taking into account the know c's.
[ a11, a12 ] [ c11, c12, c13, c14, ? ]
[ a21, a22 ] [ b11, b12, b13, b14, b15] [ c21, c22, c23, c24, c25 ]
[ a31, a32 ] x [ b21, b22, b23, b24, b25] = C ~= [ c31, c32, c33, ?, c35 ]
[ a41, a42 ] [ ?, ?, c43, c44, c45 ]
[ a51, a52 ] [ c51, c52, c53, c54, c55 ]
Note that if B is trimmed to b11 and b21 only and the unknown row 4 chomped out, then this is almost a standard least squares linear regression problem.
This problem is illposed as described.
Let A, B, and C=5, be scalars. You are asking to solve
a*b=5
which has an infinite number of solutions.
One approach, on the information provided above, is to minimize
the function g defined as
g(A,B) = ||AB-C||^2 = trace((AB-C)*(AB-C))^2
using Newtons method or a quasi-secant approach (BFGS).
(You can easily compute the gradient here).
M* is the transpose of M and multiplication is implicit.
(The norm is the frobenius norm... I removed the
underscore F as it was not displaying properly)
As this is an inherently nonlinear problem, standard linear
algebra approaches do not apply.
If you provide more information, I may be able to help more.
Some more questions: I think the issue is here is that without
more information, there is no "best solution". We need to
determine a more concrete idea of what we are looking for.
One idea, could be a "sparsest" solution. This area is
a hot area of research, with some of the best minds in the
world working here (See Terry Tao et al. work on Nuclear Norm)
This problem although tractable is still hard.
Unfortunately, I am not yet able to comment, so I will add my comments here.
As said below, LM is a great approach to solving this and is just one approach.
along the lines of the Newton type approaches to either
the optimization problem or the nonlinear solving problem.
Here is an idea, using the example you gave above: Lets define
two new vectors, V and U each with 21 elements (exactly the same number of defined
elements in C).
V is precisely the known elements of C, column ordered, so (in matlab notation)
V = [C11; C21; C31; C51; C12; .... ; C55]
U is a vector which is a column ordering of the product AB, LEAVING OUT THE
ELEMENTS CORRESPONDING TO '?' in matrix C. Collecting all the variables into x
we have
x = [a11, a21, .. a52, b11, b21 ..., b25].
f(x) = U (as defined above).
We can now try to solve f(x)=V with your favorite nonlinear least squares method.
As an aside, although a poster below recommended simulated annealing, I recommend
against it. THere are some problems it works, but it is a heuristic. When you have
powerful analytic methods such as Gauss-Newton or LM, I say use them. (in my own
experience that is)
A wild guess: A singular value decomposition might do the trick?
I have no idea on how to deal with your missing values, so I'm going to ignore that problem.
There are no unique solutions. To find a best solution you need some sort of a metric to judge them by. I'm going to suppose you want to use a least squares metric, i.e. the best guess values of A and B are those that minimize sum of the numbers [C_ij-(A B)_ij]^2.
One thing you didn't mention is how to determine the value you are going to use for n. In short, we can come up with 'good' solutions if 1 <= n <= b. This is because 1 <= rank(span(C)) <= b. Where rank(span(C)) = the dimension of the column space of C. Note that this is assuming a >= b. To be more correct we would write 1 <= rank(span(C)) <= min(a,b).
Now, supposing that you have chosen n such that 1 <= n <= b. You are going to minimize the residual sum of squares if you chose the columns of A such that span(A) = span(First n eigen vectors of C). If you don't have any other good reasons, just choose the columns of A to be to first n eigen vectors of C. Once you have chosen A, you can get the values of B in the usual linear regression way. I.e. B = (A'A)^(-1)A' C
You have a couple of options. The Levenberg-Marquadt algorithm is generally recognized as the best LS method. A free implementation is available at here. However, if the calculation is fast and you have a decent number of parameters, I would strongly suggest a Monte Carlo method such as simulated annealing.
You start with some set of parameters in the answer, and then you increase one of them by a random percentage up to a maximum. You then calculate the fitness function for your system. Now, here's the trick. You don't throw away the bad answers. You accept them with a Boltzmann probability distribution.
P = exp(-(x-x0)/T)
where T is a temperature parameter and x-x0 is the current fitness value minus the previous. After x number of iterations, you decrease T by a fixed amount (this is called the cooling schedule). You then repeat this process for another random parameter. As T decreases, fewer poor solutions are chosen, and eventually the procedure becomes a "greedy search" only accepting the solutions that improve the fit. If your system has many free parameters (> 10 or so), this is really the only way to go where you will have any chance of getting to a global minimum. This fitting method takes about 20 minutes to write in code, and a couple of hours to tweak. Hope this helps.
FYI, Wolfram has a nice discussion of this in the context of the traveling salesman problem, and I've been using it very successfully to solve some very difficult global minimization problems. It is slower than LM methods, but much better in most difficult/relatively large cases.
Based on the realization that cutting B to a single column and them removing row with unknowns converts this to very near a known problem, One approach would be to:
seed A with random values.
solve for each column of B independently.
rework the problem to allow solving for each row of A given the B values from step 2.
repeat at step 2 until things settle out.
I have no clue if that is even stable.

Resources