Support Vector Machine Geometrical Intuition

Support Vector Machine Geometrical Intuition - math

Hi,
I have a big difficult trying to understand why in the equation of the hyperplane of support vector machine there is a 1 after >=?? w.x + b >= 1 <==(why this 1??) I know that could be something about the intersection point on y axes but I cannot relate that to the support vector and to its meaning of classification.
Can anyone please explain me why the equation has that 1(-1) ?
Thank you.

The 1 is just an algebraic simplification, which comes in handy in the later optimization.
First, notice, that all three hyperplanes can be denotes as
w'x+b= 0
w'x+b=+A
w'x+b=-A
If we would fix the norm of the normal w, ||w||=1, then the above would have one solution with some arbitrary A depending on the data, lets call our solution v and c (values of optimal w and b respectively). But if we let w to have any norm, then we can easily see, that if we put
w'x+b= 0
w'x+b=+1
w'x+b=-1
then there is one unique w which satisfies these equations, and it is given by w=v/A, b=c/A, because
(v/A)'x+(b/A)= 0 (when v'x+b=0) // for the middle hyperplane
(v/A)'x+(b/A)=+1 (when v'x+b=+A) // for the positive hyperplane
(v/A)'x+(b/A)=-1 (when v'x+b=-A) // for the negative hyperplane
In other words - we assume that these "supporting vectors" satisfy w'x+b=+/-1 equation for future simplification, and we can do it, because for any solution satisfing v'x+c=+/-A there is a solution for our equation (with different norm of w)
So once we have these simplifications our optimization problem simplifies to the minimization of the norm of ||w|| (maximization of the size of the margin, which now can be expressed as `2/||w||). If we would stay with the "normal" equation with (not fixed!) A value, then the maximization of the margin would be in one more "dimension" - we would have to look through w,b,A to find the triple which maximizes it (as the "restrictions" would be in the form of y(w'x+b)>A). Now, we just search through w and b (and in the dual formulation - just through alpha but this is the whole new story).
This step is not required. You can build SVM without it, but this makes thing simplier - the Ockham's razor rule.

This boundary is called "margin" and must be maximized then you have to minimize ||w||.
The aim of SVM is to find a hyperplane able to maximize the distances between the two groups.
However there are infinite solutions ( see figure: move the optimal hyperplane along the perpendicualr vector) and we need to fix at least the boundaries: the +1 or -1 is a common convention to avoid these infinite solutions.
Formally you have to optimize r ||w|| and we set a bounadry condition r ||w|| = 1.

Related

Why do we calculate the margin in SVM?

I'm learning SVM (Support Vector Machine) : there are several points that remain ambiguous : (linearly separable, primal case)
I know how to find the weigth w and the hyperplan equation, but if we can deduce the support vectors from it, why do we calculate the margin ? What do I need to calculate first ? In which case ? (Sorry for those mixed questions, but I'm really lost with it)
I saw in some exemples that the margin is caculated in this manner :
1 / ||w||
while in others, this way :
2 / ||w||
so what is the difference between those two cases ?
Thanks

The optimization objective of SVM is to reduce w, b in such a way that we have the maximum margin with the hyperplane.
Mathematically speaking,
it is a nonlinear optimization task which is solved by KKT (Karush-Kunn-Tucker) conditions, using lagrange multipliers.
The following video explains this in simple terms for linearly seperable case
https://www.youtube.com/watch?v=1NxnPkZM9bc
Also how this is calculated is better explained here for both linear and primal cases.
https://www.csie.ntu.edu.tw/~cjlin/talks/rome.pdf

The margin between the separating hyperplane and the class boundaries of an SVM is an essential feature of this algorithm.
See, you have two hyperplanes (1) w^tx+b>=1, if y=1 and (2) w^tx+b<=-1, if y=-1. This says that any vector with a label y=1 must lie ether on or behind the hyperplane (1). The same applies to the vectors with label y=-1 and hyperplane (2).
Note: If those requirements can be fulfilled, it implicitly means the dataset is linearly separatable. This makes sense because otherwise no such margin can be constructed.
So, what an SVM tries to find is a decision boundary which ist half-way between (1) and (2). Let's define this boundary as (3) w^tx+b=0. What you see here is that (1), (2) and (3) are parallel hyperplanes because they share the same parameters w and b. The parameters w holds the direction of those planes. Recall that a vector always has a direction and a magnitude/length.
The question is now: How can one calculate the hyperplane (3)? The equations (1) and (2) tell us that any vector with a label y=1 which is closest to (3) lies exactly on the hyperplane (1), hence (1) becomes w^tx+b=1 for such x. The similar applies for the closest vectors with a negative label and (2). Those vectors on the planes called 'support vectors' and the decision boundary (3) only depends on those, because one simply can subtract (2) from (1) for the support vectors and gets:
w^tx+b-w^tx+b=1-(-1) => wt^x-w^tx=2
Note: x for the two planes are different support vectors.
Now, we want to get the direction of w but ignoring it's length to get the shortest distance between (3) and the other planes. This distance is a perpendicular line segment from (3) to the others. To do so, one can divide by the length of w to get the norm vector which is perpendicular to (3), hence (wt^x-w^tx)/||w||=2/||w||. By ignoring the left hand site (it's equal) we see that the distance between the two planes is actually 2/||w||. This distance must be maximized.
Edit:
As others state here, use Lagrange multipliers or the SMO algorithm to minimize the term
1/2 ||w||^2
s.t. y(w^tx+b)>=1
This is the convex form of the optimization problem for the primal svm.

Flexagon Simulation

What is the best way to simulate a flexagon?
My best guess at a starting point is to represent the faces and edges, and simulate transformations based where edges meet. I'm thinking that in the process of implementing a transformation, it will be apparent when folding in a given direction is physically impossible.
I'm going to try to figure this out by experimentation, but it definitely feels like the kind of problem where a gap in my facility with mathematics is holding me back.
Edit: To clarify, I'm interested in what sort of data structures I could use to represent a flexagon and how I can manipulate those data structures to simulate the folding of a flexagon.

If you write all of the invariants of the flexagon as a system of equations, small deviations around legal states may be written as a linear system. For instance, the stiffness of a piece of paper between (x1,y1) and (x2,y2) enforces
(x1 - x2)**2 + (y1 - y2)**2 - L**2 == 0
This can be be softened to
chi2 = (x1 - x2)**2 + (y1 - y2)**2 - L**2 + other constraints...
Derivatives of chi2 with respect to x1, x2, y1, y2 yield linear equations. A system of linear equations is a matrix, and an eigenvalue/eigenvector decomposition of that matrix give you linear combinations of the x1, x2, y1, y2 parameters that are easy or hard to bend. The eigenvectors are a basis set of possible directions and each one's corresponding eigenvalue tells you how hard it is to bend in that direction. Larger eigenvalues are more constrained.
A problem with the above is that if there are any directions that are truly allowed, that is, the derivative of chi2 with respect to p is 0 (the original constraint is absolutely satisfied), then the matrix is singular and can't be inverted to get the eigensystem. If you only want to know what those absolutely allowed directions are, you can compute the null space of the matrix instead of its eigensystem. However, I suspect (never having played with a flexagon) that the "allowed" directions involve a little bit of bending, in which case chi2 is small but nonzero. Then you'd be looking for small but nonzero eigenvalues. Other degrees of freedom are allowed and uninteresting, such as translation or rotation of the whole object. To turn it into a pure eigensystem problem (no null space at all), add constraints to the system with arbitrarily small constants lambda:
chi2 += lambda_x * (x1 + x2)**2/4.0 + lambda_y * (y1 + y2)**2/4.0
You'll recognize them in your solution because they'll vary as you vary each lambda. (The example above gives a penalty lambda_x to translating in x and lambda_y to translating in y.)
In terms of implementation, you can use any linear algebra software to compute solutions and check for variation with the lambdas. I used Python to prototype a problem like this (detector alignment in high energy physics, in which the constraints are measurements like "this detector is 3 cm from that detector" and the chi2 was derived from the uncertainties "3 cm +- 0.1 cm") and then ported the solution to C++ (BLAS) for production. The Numpy library for Python had enough linear algebra (it's BLAS under the hood), though I also used the generic, non-linear minimizers in Scipy to debug the matrix solution. The hardest part is getting the indexes to line up right, which is necessary when casting it as a matrix and not when you give an objective function to a generic minimizer (because you use variable names instead). This is more of a Matlab or Mathematica problem, so if you're more comfortable with one of them, use it instead. This problem will require a lot of trial and error, so use the most interactive system possible (one with a good REPL or worksheet/notebook-style interface).
It can also be helpful to draw a graph of the connections (graph-theory graph, not a plot), on which to label their constraints. For me, that was a necessary first step before writing out the equations.
It might also help to visualize the system by writing a set of functions that take parameter values (x1, etc.) and draw the figure with OpenGL (or other 3-D mesh renderer). This can show you if some constraint is being violated, because the mesh tiles would pass theough each other. It can also help you identify the degrees of freedom represented by each eigenvector: vary the parameters by the linear combination represented by the eigenvector and you'll see if it's just translating/rotating or if it's doing some interesting twist or fold.

Solving a system of linear equations in a non-square matrix [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
I have a system of linear equations that make up an NxM matrix (i.e. Non-square) which I need to solve - or at least attempt to solve in order to show that there is no solution to the system. (more likely than not, there will be no solution)
As I understand it, if my matrix is not square (over or under-determined), then no exact solution can be found - am I correct in thinking this? Is there a way to transform my matrix into a square matrix in order to calculate the determinate, apply Gaussian Elimination, Cramer's rule, etc?
It may be worth mentioning that the coefficients of my unknowns may be zero, so in certain, rare cases it would be possible to have a zero-column or zero-row.

Whether or not your matrix is square is not what determines the solution space. It is the rank of the matrix compared to the number of columns that determines that (see the rank-nullity theorem). In general you can have zero, one or an infinite number of solutions to a linear system of equations, depending on its rank and nullity relationship.
To answer your question, however, you can use Gaussian elimination to find the rank of the matrix and, if this indicates that solutions exist, find a particular solution x0 and the nullspace Null(A) of the matrix. Then, you can describe all your solutions as x = x0 + xn, where xn represents any element of Null(A). For example, if a matrix is full rank its nullspace will be empty and the linear system will have at most one solution. If its rank is also equal to the number of rows, then you have one unique solution. If the nullspace is of dimension one, then your solution will be a line that passes through x0, any point on that line satisfying the linear equations.

Ok, first off: a non-square system of equations can have an exact solution
[ 1 0 0 ][x] = [1]
[ 0 0 1 ][y] [1]
[z]
clearly has a solution (actually, it has an 1-dimensional family of solutions: x=z=1). Even if the system is overdetermined instead of underdetermined it may still have a solution:
[ 1 0 ][x] = [1]
[ 0 1 ][y] [1]
[ 1 1 ] [2]
(x=y=1). You may want to start by looking at least squares solution methods, which find the exact solution if one exists, and "the best" approximate solution (in some sense) if one does not.

Taking Ax = b, with A having m columns and n rows. We are not guaranteed to have one and only one solution, which in many cases is because we have more equations than unknowns (m bigger n). This could be because of repeated measurements, that we actually want because we are cautious about influence of noise.
If we observe that we can not find a solution that actually means, that there is no way to find b travelling the column space spanned by A. (As x is only taking a combination of the columns).
We can however ask for the point in the space spanned by A that is nearest to b. How can we find such a point? Walking on a plane the closest one can get to a point outside it, is to walk until you are right below. Geometrically speaking this is when our axis of sight is perpendicular to the plane.
Now that is something we can have a mathematical formulation of. A perpendicular vector reminds us of orthogonal projections. And that is what we are going to do. The simplest case tells us to do a.T b. But we can take the whole matrix A.T b.
For our equation let us apply the transformation to both sides: A.T Ax = A.T b.
Last step is to solve for x by taking the inverse of A.T A:
x = (A.T A)^-1 * A.T b

The least squares recommendation is a very good one.
I'll add that you can try a singular value decomposition (SVD) that will give you the best answer possible and provide information about the null space for free.

How to plot implicit equations

What is the usual method or algorithm used to plot implicit equations of 2 variables?
I am talking about equations such as,
sin(x*y)*y = 20
x*x - y*y = 1
Etc.
Does anyone know how Maple or Matlab do this? My target language is C#.
Many thanks!

One way to do this is to sample the function on a regular, 2D grid. Then you can run an algorithm like marching squares on the resulting 2D grid to draw iso-contours.
In a related question, someone also linked to the gnuplot source code. It's fairly complex, but might be worth going through. You can find it here: http://www.gnuplot.info/

Iterate the value of x across the range you want to plot. For each fixed value of x, solve the equation numerically using a method such as interval bisection or the Newton-Raphson method (for which you can calculate the derivative using implicit differentiation, or perhaps differentiate numerically). This will give you the corresponding value of y for a given x. In most cases, you won't need too many iterations to get a very precise result, and it's very efficient anyway.
Note that you will need to transform the equation into the form f(x) = 0, though this is always trivial. The nice thing about this method is that it works just as well the other way round (i.e. taking a fixed range of y and computing x per value).

There're multiple methods. The easiest algorithm I could find is descripted here:
https://homepages.warwick.ac.uk/staff/David.Tall/pdfs/dot1986b-implicit-fns.pdf and describes what Noldorin has described you.
The most complex one, and seems to be the one that can actually solve a lot of special cases is described here:
https://academic.oup.com/comjnl/article/33/5/402/480353

i think,
in matlab you give array as input for x.
then for every x, it calculates y.
then draws line from x0,y0 to x1, y1
then draws line from x1,y1 to x2, y2
...
...

Approximating nonparametric cubic Bezier

What is the best way to approximate a cubic Bezier curve? Ideally I would want a function y(x) which would give the exact y value for any given x, but this would involve solving a cubic equation for every x value, which is too slow for my needs, and there may be numerical stability issues as well with this approach.
Would this be a good solution?

Just solve the cubic.
If you're talking about Bezier plane curves, where x(t) and y(t) are cubic polynomials, then y(x) might be undefined or have multiple values. An extreme degenerate case would be the line x= 1.0, which can be expressed as a cubic Bezier (control point 2 is the same as end point 1; control point 3 is the same as end point 4). In that case, y(x) has no solutions for x != 1.0, and infinite solutions for x == 1.0.
A method of recursive subdivision will work, but I would expect it to be much slower than just solving the cubic. (Unless you're working with some sort of embedded processor with unusually poor floating-point capacity.)
You should have no trouble finding code that solves a cubic that has already been thoroughly tested and debuged. If you implement your own solution using recursive subdivision, you won't have that advantage.
Finally, yes, there may be numerical stablility problems, like when the point you want is near a tangent, but a subdivision method won't make those go away. It will just make them less obvious.
EDIT: responding to your comment, but I need more than 300 characters.
I'm only dealing with bezier curves where y(x) has only one (real) root. Regarding numerical stability, using the formula from http://en.wikipedia.org/wiki/Cubic_equation#Summary, it would appear that there might be problems if u is very small. – jtxx000
The wackypedia article is math with no code. I suspect you can find some cookbook code that's more ready-to-use somewhere. Maybe Numerical Recipies or ACM collected algorithms link text.
To your specific question, and using the same notation as the article, u is only zero or near zero when p is also zero or near zero. They're related by the equation:
u^^6 + q u^^3 == p^^3 /27
Near zero, you can use the approximation:
q u^^3 == p^^3 /27
or p / 3u == cube root of q
So the computation of x from u should contain something like:
(fabs(u) >= somesmallvalue) ? (p / u / 3.0) : cuberoot (q)
How "near" zero is near? Depends on how much accuracy you need. You could spend some quality time with Maple or Matlab looking at how much error is introduced for what magnitudes of u. Of course, only you know how much accuracy you need.
The article gives 3 formulas for u for the 3 roots of the cubic. Given the three u values, you can get the 3 corresponding x values. The 3 values for u and x are all complex numbers with an imaginary component. If you're sure that there has to be only one real solution, then you expect one of the roots to have a zero imaginary component, and the other two to be complex conjugates. It looks like you have to compute all three and then pick the real one. (Note that a complex u can correspond to a real x!) However, there's another numerical stability problem there: floating-point arithmetic being what it is, the imaginary component of the real solution will not be exactly zero, and the imaginary components of the non-real roots can be arbitrarily close to zero. So numeric round-off can result in you picking the wrong root. It would be helpfull if there's some sanity check from your application that you could apply there.
If you do pick the right root, one or more iterations of Newton-Raphson can improve it's accuracy a lot.

Yes, de Casteljau algorithm would work for you. However, I don't know if it will be faster than solving the cubic equation by Cardano's method.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex