SVM: Why is Maximize margin == minimize Euclidean norm? - margin

In the SVM opimization problem, we either want to maximise the margin of 2/||w||,
or minimise the Euclidean Norm of weight vector w:
(1/2)*w^t*w
Can somebody explain to me why the Euclidean Norm is the formula above? And not 1/sqrt(w^t*w)?
I assume euclidean norm is the Euclidean distance, how do we get to that formula?

The reason is that the following three are equivalent (under suitable mathematical conditions which are usually met):
Maximize a quantity z.
Maximize f(z), where f is a strictly growing function.
Minimize g(z), where g is a strictly decreasing function.
In your case, set z=||w||, and apply the above the other way round. Then minimizing||w|| is equivalent to minimizing f(z) = 1/2 ||w||^2, and to maximizing g(z) = 2/||w||.

We have two boundaries 𝑀⋅π‘₯+𝑏=1 and 𝑀⋅π‘₯+𝑏=βˆ’1 and a middle one which is 𝑀⋅π‘₯+𝑏=0. Now we want to figure out the distance between each of these two lines with the middle one. If we call the distance πœ†.
We consider a point z on 𝑀⋅π‘₯+𝑏=1, then setting this point z as the origin, the point on 𝑀⋅π‘₯+𝑏=0 would be π‘§βˆ’πœ†β‹…π‘€||𝑀||, here, the distance between the middle line and 𝑀⋅π‘₯+𝑏=1 is the distance between z and π‘§βˆ’πœ†β‹…π‘€||𝑀||
now since this point is on 𝑀⋅π‘₯+𝑏=0, we would have:
𝑀⋅(π‘§βˆ’πœ†β‹…π‘€ / ||𝑀||)+𝑏 = 0
𝑀⋅𝑧+𝑏 βˆ’π‘€β‹…πœ†β‹…π‘€ / ||𝑀|| = 0
we know that since z is located on 𝑀⋅π‘₯+𝑏=1, then 𝑀⋅𝑧+𝑏 equals 1, hence:
1 βˆ’π‘€β‹…πœ†β‹…π‘€ / ||𝑀|| = 0
π‘€β‹…πœ†β‹…π‘€ / ||𝑀|| = 1
πœ† β‹… ||𝑀||^2 / ||𝑀|| = 1
πœ† β‹… ||𝑀|| = 1
πœ† = 1/||𝑀||

Related

How do I minimize a linear least squares function in R?

I'm reading Deep Learning by Goodfellow et al. and am trying to implement gradient descent as shown in Section 4.5 Example: Linear Least Squares. This is page 92 in the hard copy of the book.
The algorithm can be viewed in detail at https://www.deeplearningbook.org/contents/numerical.html with R implementation of linear least squares on page 94.
I've tried implementing in R, and the algorithm as implemented converges on a vector, but this vector does not seem to minimize the least squares function as required. Adding epsilon to the vector in question frequently produces a "minimum" less than the minimum outputted by my program.
options(digits = 15)
dim_square = 2 ### set dimension of square matrix
# Generate random vector, random matrix, and
set.seed(1234)
A = matrix(nrow = dim_square, ncol = dim_square, byrow = T, rlnorm(dim_square ^ 2)/10)
b = rep(rnorm(1), dim_square)
# having fixed A & B, select X randomly
x = rnorm(dim_square) # vector length of dim_square--supposed to be arbitrary
f = function(x, A, b){
total_vector = A %*% x + b # this is the function that we want to minimize
total = 0.5 * sum(abs(total_vector) ^ 2) # L2 norm squared
return(total)
}
f(x,A,b)
# how close do we want to get?
epsilon = 0.1
delta = 0.01
value = (t(A) %*% A) %*% x - t(A) %*% b
L2_norm = (sum(abs(value) ^ 2)) ^ 0.5
steps = vector()
while(L2_norm > delta){
x = x - epsilon * value
value = (t(A) %*% A) %*% x - t(A) %*% b
L2_norm = (sum(abs(value) ^ 2)) ^ 0.5
print(L2_norm)
}
minimum = f(x, A, b)
minimum
minimum_minus = f(x - 0.5*epsilon, A, b)
minimum_minus # less than the minimum found by gradient descent! Why?
On page 94 of the pdf appearing at https://www.deeplearningbook.org/contents/numerical.html
I am trying to find the values of the vector x such that f(x) is minimized. However, as demonstrated by the minimum in my code, and minimum_minus, minimum is not the actual minimum, as it exceeds minimum minus.
Any idea what the problem might be?
Original Problem
Finding the value of x such that the quantity Ax - b is minimized is equivalent to finding the value of x such that Ax - b = 0, or x = (A^-1)*b. This is because the L2 norm is the euclidean norm, more commonly known as the distance formula. By definition, distance cannot be negative, making its minimum identically zero.
This algorithm, as implemented, actually comes quite close to estimating x. However, because of recursive subtraction and rounding one quickly runs into the problem of underflow, resulting in massive oscillation, below:
Value of L2 Norm as a function of step size
Above algorithm vs. solve function in R
Above we have the results of A %% x followed by A %% min_x, with x estimated by the implemented algorithm and min_x estimated by the solve function in R.
The problem of underflow, well known to those familiar with numerical analysis, is probably best tackled by the programmers of lower-level libraries best equipped to tackle it.
To summarize, the algorithm appears to work as implemented. Important to note, however, is that not every function will have a minimum (think of a straight line), and also be aware that this algorithm should only be able to find a local, as opposed to a global minimum.

Constrained optimization of a vector

I have a (non-symmetric) probability matrix, and an observed vector of integer outcomes. I would like to find a vector that maximises the probability of the outcomes, given the transition matrix. Simply, I am trying to estimate a distribution of particles at sea given their ultimate distribution on land, and a matrix of probabilities of a particle released from a given point in the ocean ending up at a given point on the land.
The vector that I want to find is subject to the constraint that all components must be between 0-1, and the sum of the components must equal 1. I am trying to figure out the best optimisation approach for the problem.
My transition matrix and data set are quite large, but I have created a smaller one here:
I used a simulated known at- sea distribution of
msim<-c(.3,.2,.1,.3,.1,0) and a simulated probability matrix (t) to come up with an estimated coastal matrix (Datasim2), as follows:
t<-matrix (c(0,.1,.1,.1,.1,.2,0,.1,0,0,.3,0,0,0,0,.4,.1,.3,0,.1,0,.1,.4,0,0,0,.1,0,.1,.1),
nrow=5,ncol=6, byrow=T)
rownames(t)<-c("C1","C2","C3","C4","C5") ### locations on land
colnames(t)<-c("S1","S2","S3","S4","S5","S6") ### locations at sea
Datasim<-as.numeric (round((t %*% msim)*500))
Datasim2<-c(rep("C1",95), rep("C2",35), rep("C3",90),rep("C4",15),rep("C5",30))
M <-c(0.1,0.1,0.1,0.1,0.1,0.1) ## starting M
I started with a straightforward function as follows:
EstimateSource3<-function(M,Data,T){
EstEndProbsall<-M%*%T
TotalLkhd<-rep(NA, times=dim(Data)[1])
for (j in 1:dim(Data)[1]){
ObsEstEndLkhd<-0
ObsEstEndLkhd<-1-EstEndProbsall[1,] ## likelihood of particle NOT ending up at locations other than the location of interest
IndexC<-which(colnames(EstEndProbsall)==Data$LocationCode[j], arr.ind=T) ## likelihood of ending up at location of interest
ObsEstEndLkhd[IndexC]<-EstEndProbsall[IndexC]
#Total likelihood
TotalLkhd[j]<-sum(log(ObsEstEndLkhd))
}
SumTotalLkhd<-sum(TotalLkhd)
return(SumTotalLkhd)
}
DistributionEstimate <- optim(par = M, fn = EstimateSource3, Data = Datasim2, T=t,
control = list(fnscale = -1, trace=5, maxit=500), lower = 0, upper = 1)
To constrain the sum to 1, I tried using a few of the suggestions posted here:How to set parameters' sum to 1 in constrained optimization
e.g. adding M<-M/sum(M) or SumTotalLkhd<-SumTotalLkhd-(10*pwr) to the body of the function, but neither yielded anything like msim, and in fact, the 2nd solution came up with the error β€œL-BFGS-B needs finite values of 'fn'”
I thought perhaps the quadprog package might be of some help, but I don’t think I have a symmetric positive definite matrix…
Thanks in advance for your help!
What about that: Let D = distribution at land, M = at sea, T the transition matrix. You know D, T, you want to calculate M. You have
D' = M' T
hence D' T' = M' (T T')
and accordingly D'T'(T T')^(-1) = M'
Basically you solve it as when doing linear regression (seems SO does not support math notation: ' is transpose, ^(-1) is ordinary matrix inverse.)
Alternatively, D may be counts of particles, and now you can ask questions like: what is the most likely distribution of particles at sea. That needs a different approach though.
Well, I have never done such models but think along the following lines. Let M be of length 3 and D of length 2, and T is hence 3x2. We know T and we observe D_1 particles at location 1 and D_2 particles at location 2.
What is the likelihood that you observe one particle at location D_1? It is Pr(D = 1) = M_1 T_11 + M_2 T_21 + M_3 T_32. Analogously, Pr(D = 2) = M_1 T_12 + M_2 T_22 + M_3 T_32. Now you can easily write the log-likelihood of observing D_1 and D_2 particles at locations 1 and 2. The code might look like this:
loglik <- function(M) {
if(M[1] < 0 | M[1] > 1)
return(NA)
if(M[2] < 0 | M[2] > 1)
return(NA)
M3 <- 1 - M[1] - M[2]
if(M3 < 0 | M3 > 1)
return(NA)
D[1]*log(T[1,1]*M[1] + T[2,1]*M[2] + T[3,1]*M3) +
D[2]*log(T[1,2]*M[1] + T[2,2]*M[2] + T[3,2]*M3)
}
T <- matrix(c(0.1,0.2,0.3,0.9,0.8,0.7), 3, 2)
D <- c(100,200)
library(maxLik)
m <- maxLik(loglik, start=c(0.4,0.4), method="BFGS")
summary(m)
I get the answer (0, 0.2, 0.8) when I estimate it but standard errors are very large.
As I told, I have never done it so I don't know it it makes sense.

Solving mixed linear and differential systems of equations with R [duplicate]

I've got a bit of a weird set of conditions I need to fit a curve to. I've tried looking it up elsewhere but I'm not even sure I'm using the right lingo. Any help is much appreciated.
I'm trying to fit a polynomial curve to a set of four points. Three of the points are known, but the fourth one is a little tricky. I have the x value for the maximum y value, but I don't know what the maximum y value is. For example, let's say there are known points at (0,0), (1,1), and (4,0). The maximum y value is at x=3 so the fourth point is (3, ymax). How would I fit a 4th order polynomial curve to those conditions? Thanks in advance.
Actually it is possible since you require the y value at x=3 should be maximum. So, a degree 4 polynomial has 5 coefficients to be determined and you have the following equations:
y(0) = 0
y(1) = 1
y(4) = 0
dy/dx(3) = 0 (first derivative at x=3 should be 0)
d2y/dx2(3) < 0 (2nd derivative at x=3 should be negative)
So, pick any negative value for d2y/dx2 at x=3 and solve the 5 linear equations and you will get one degree 4 polynomial. Note that the y value at x=3 obtained this way is only a local maximum, not a global maximum.
Filling in the algebra from #fang's answer (a little elementary calculus, some algebra, and some linear algebra):
y = a+b*x+c*x^2+d*x^3+e*x^4
y(0) = 0 -> a=0
Set a=0 for the rest of the computations.
y(1) = 1 -> b+c+d+e = 1
y(4) = 0 -> 4*b+16*c+64*d+256*e=0
dy/dx(3)=0 ->
b+2*x*c+3*x^2*d+4*x^3*e=0 ->
b+6*c+27*d+108*e=0
d2y/dx2(3)<0 = 2*c+6*d*x+12*e*x^2 < 0
= 2*c+18*d+108*e < 0
Pick a negative value V for the last expression, say -1:
V <- -1
A <- matrix(c(1, 1, 1, 1,
4,16,64,256,
1, 6,27,108,
0, 2,18,108),
ncol=4,byrow=TRUE)
b <- c(1,0,0,V)
(p <- solve(A,b))
## [1] 2.6400000 -2.4200000 0.8933333 -0.1133333
x <- seq(-0.5,5,length=101)
m <- model.matrix(~poly(x,degree=4,raw=TRUE))
y <- m %*% c(0,p)
Plot results:
par(las=1,bty="l")
plot(x,y,type="l")
points(c(0,1,4),c(0,1,0))
abline(v=3,lty=2)
Picking a larger magnitude (more negative) value for V will make the solution more sharply peaked at y=3.

Calculate the length of a segment of a quadratic bezier

I use this algorithm to calculate the length of a quadratic bezier:
http://www.malczak.linuxpl.com/blog/quadratic-bezier-curve-length/
However, what I wish to do is calculate the length of the bezier from 0 to t where 0 < t < 1
Is there any way to modify the formula used in the link above to get the length of the first segment of a bezier curve?
Just to clarify, I'm not looking for the distance between q(0) and q(t) but the length of the arc that goes between these points.
(I don't wish to use adaptive subdivision to aproximate the length)
Since I was sure a similar form solution would exist for that variable t case - I extended the solution given in the link.
Starting from the equation in the link:
Which we can write as
Where b = B/(2A) and c = C/A.
Then transforming u = t + b we get
Where k = c - b^2
Now we can use the integral identity from the link to obtain:
So, in summary, the required steps are:
Calculate A,B,C as in the original equation.
Calculate b = B/(2A) and c = C/A
Calculate u = t + b and k = c -b^2
Plug these values into the equation above.
[Edit by Spektre] I just managed to implement this in C++ so here the code (and working correctly matching naively obtained arc lengths):
float x0,x1,x2,y0,y1,y2; // control points of Bezier curve
float get_l_analytic(float t) // get arclength from parameter t=<0,1>
{
float ax,ay,bx,by,A,B,C,b,c,u,k,L;
ax=x0-x1-x1+x2;
ay=y0-y1-y1+y2;
bx=x1+x1-x0-x0;
by=y1+y1-y0-y0;
A=4.0*((ax*ax)+(ay*ay));
B=4.0*((ax*bx)+(ay*by));
C= (bx*bx)+(by*by);
b=B/(2.0*A);
c=C/A;
u=t+b;
k=c-(b*b);
L=0.5*sqrt(A)*
(
(u*sqrt((u*u)+k))
-(b*sqrt((b*b)+k))
+(k*log(fabs((u+sqrt((u*u)+k))/(b+sqrt((b*b)+k)))))
);
return L;
}
There is still room for improvement as some therms are computed more than once ...
While there may be a closed form expression, this is what I'd do:
Use De-Casteljau's algorithm to split the bezier into the 0 to t part and use the algorithm from the link to calculate its length.
You just have to evaluate the integral not between 0 and 1 but between 0 and t. You can use the symbolic toolbox of your choice to do that if you're not into the math. For instance:
http://integrals.wolfram.com/index.jsp?expr=Sqrt\[a*x*x%2Bb*x%2Bc\]&random=false
Evaluate the result for x = t and x = 0 and subtract them.

Implementing additional constraints in R's nnls

I am using the R interface to the Lawson-Hanson NNLS implementation of an algorithm for non-negative linear least squares that solves ||A x - b||^2 with the constraint that all elements of vector x β‰₯ 0. This works fine but I would like to add further constrains. Of interest to me are:
Also minimize "energy" of x:
||A x - b||^2 + m*||x||^2
Minimize "energy in the x derivative"
||A x - b||^2 + m ||H x||^2, where H is the sum of identity and a matrix with -1 on the first off-diagonal
Most generally, minimize ||A x - b||^2 + m ||H x - f||^2.
Is there are a way to coax nnls to do this by some clever way of restating the problems 1.-3. Above? The reason I have hope for such a thing is that there is a little-throw away comment in a paper by Whitall et al (sorry for the paywall) that claims that "fortunately, NNLS can be adopted from the original form above to accommodate something in problem 3".
I take it m is a scalar, right? Consider the simple case m=1; you can generalize for other values of m by letting H* = sqrt(m) H and f* = sqrt(m) f and using the solution method given here.
So now you're trying to minimise ||A x - b||^2 + ||H x - f||^2.
Let A* = [A' | H']' and let b* = [b' | f']' (i.e. stack up A on top of H and b on top of f) and solve the original problem of
non-negative linear least squares on ||A* x - b*||^2 with the constraint that all elements of vector x β‰₯ 0 .

Resources