Julia: use convolution to compute second derivative? - julia

I was trying to compute the second derivative of a function represented in a array. I am using
conv(u,[1,-2,1])[2:length(u) - 1]/dx
as a way of approximating the second derivative, but it is not accurate enough for my purpose. Does anyone know another method of taking the second derivatives?
Thanks

Related

(Differentiable Image Sampling) Custom Integer Sampling Kernel, Spatial Transformer Network

I was looking through the Spatial Transformer Network paper, and I am trying to implement a custom grid_sample function (inheriting the autograd.Function class) in PyTorch for the Integer Sampling Kernel.
While defining the backward function, I have come across the following conundrum.
Given that the integer sampling works as the following:
I think that the gradients w.r.t the input map and the transformed grid (x_i^s, y_i^s) should be like the following:
Gradient w.r.t. input map:
Gradient w.r.t transformed grid (x_i^s):
Gradient w.r.t transformed grid (y_i^s):
as the derivative of the Kronecker delta function is zero (I'm unsure about this!! - HELP)
Derivative of the Kronecker delta?
Thus I am reaching a conclusion that the gradient w.r.t to the input should be: a tensor of the same size as the input filled with ones if the pixel was sampled and 0 if it wasn't sampled, and the gradient w.r.t the transformed grid should be a tensor full of zeros.
However, if the gradient of the transformed grid is 0, then due to the chain rule, no information will be passed on to the layers before the integer sampler. Therefore I think the derivate with respect to the grid should be something else. Could anybody point out what I'm doing wrong?
Many thanks in advance!
For future reference, and for those who might have had similar questions to the one I posted.
I've emailed Dr Jaderberg (one of the authors of the 'Spatial Transformer Networks') about this question and he has confirmed: "that the gradient wrt the coordinates for integer sampling is 0.". So I wasn't doing anything wrong, and it was right all along!
He was very kind in his response and expressed that integer sampling was mentioned in the paper to introduce the bilinear sampling scheme, and have given insights into how to possibly implement integer sampling if I really wanted to:
"you could think about using some numerical differentiation techniques (e.g. look at difference of x to its neighbours). This would assume smoothness in the image wrt coordinates."
So with great thanks to Dr Jaderberg, I'm happy to close this question.
I guess thinking about how I'd use numerical methods to implement the integer kernel for the sampling function is another challenge for myself, but until then I guess the bilinear sampler is my friend! :)

Why do we use log probability in deep learning?

I got curious while reading the paper 'Sequence to Sequence Learning with Neural Networks'.
In fact, not only this paper but also many other papers use log probabilities, is there a reason for that?
Please check the attached photo.
Two reasons -
Theoretical - Probabilities of two independent events A and B co-occurring together is given by P(A).P(B). This easily gets mapped to a sum if we use log, i.e. log(P(A)) + log(P(B)). It is thus easier to address the neuron firing 'events' as a linear function.
Practical - The probability values are in [0, 1]. Hence multiplying two or more such small numbers could easily lead to an underflow in a floating point precision arithmetic (e.g. consider multiplying 0.0001*0.00001). A practical solution is to use the logs to get rid of the underflow.
For any given problem we need to optimise the likelihood of parameters. But optimising the product require all data at once and requires huge computation.
We know that a sum is a lot easier to optimise as the derivative of a sum is the sum of derivatives. So, taking log convert it to sum and makes computation faster.
Refer this

R: Is integration of a sum faster than double sum of easy calculation?

I have a function which involves a sum.
I need to calculate the integral of this function squared.
I can easily use integrate() here and it's okay fast.
But I actually don't need to use integrate(). Since
the square of a sum is a double sum,
and the integral of a double sum is the double sum of an integral of the inner-most terms,
and I can calculate the integral analytically of the inner-most terms
I can also calculate my integral by following the above step, so that my final calculation is a double-sum over the integral calculated analytically (i.e. I know what the actual expression for the integral is).
Would this be faster or slower than the above integrate() method generally speaking (i.e. what's your hunch and why). I know a double-sum is slow, but I can always use Rcpp.
I can't test this for myself since I don't know what the formula in step 3. above is. I don't want to spend time calculating unless I know if it is worth it...

In what situation would a taylor series for a polynomial be necessary?

I'm having a hard time understanding why it would be useful to use the Taylor series for a function in order to gain an approximation of a function, instead of just using the function itself when programming. If I can tell my computer to compute e^(.1) and it will give me an exact value, why would I take an approximation instead?
Taylor series are generally not used to approximate functions. Usually, some form of minimax polynomial is used.
Taylor series converge slowly (it takes many terms to get the accuracy desired) and are inefficient (they are more accurate near the point around which they are centered and less accurate away from it). The largest use of Taylor series is likely in mathematics classes and papers, where they are useful for examining the properties of functions and for learning about calculus.
To approximate functions, minimax polynomials are often used. A minimax polynomial has the minimum possible maximum error for a particular situation (interval over which a function is to be approximated, degree available for the polynomial). There is usually no analytical solution to finding a minimax polynomial. They are found numerically, using the Remez algorithm. Minimax polynomials can be tailored to suit particular needs, such as minimizing relative error or absolute error, approximating a function over a particular interval, and so on. Minimax polynomials need fewer terms than Taylor series to get acceptable results, and they “spread” the error over the interval instead of being better in the center and worse at the ends.
When you call the exp function to compute ex, you are likely using a minimax polynomial, because somebody has done the work for you and constructed a library routine that evaluates the polynomial. For the most part, the only arithmetic computer processors can do is addition, subtraction, multiplication, and division. So other functions have to be constructed from those operations. The first three give you polynomials, and polynomials are sufficient to approximate many functions, such as sine, cosine, logarithm, and exponentiation (with some additional operations of moving things into and out of the exponent field of floating-point values). Division adds rational functions, which is useful for functions like arctangent.
For two reasons. First and foremost - most processors do not have hardware implementations of complex operations like exponentials, logarithms, etc... In such cases the programming language may provide a library function for computing those - in other words, someone used a taylor series or other approximation for you.
Second, you may have a function that not even the language supports.
I recently wanted to use lookup tables with interpolation to get an angle and then compute the sin() and cos() of that angle. Trouble is that it's a DSP with no floating point and no trigonometric functions so those two functions are really slow (software implementation). Instead I put sin(x) in the table instead of x and then used the taylor series for y=sqrt(1-x*x) to compute the cos(x) from that. This taylor series is accurate over the range I needed with only 5 terms (denominators are all powers of two!) and can be implemented in fixed point using plain C and generates code that is faster than any other approach I could think of.

Quadratic programming solver that guarantees a boundary point?

I have a problem that I have expressed as the minimization of a convex quadratic program with linear constraints. The problem is that I want to disallow any point that is strictly interior (i.e. I only find the answer useful if it is on a vertex of the feasible region.
I'd like to do this without modifying the objective function. I have already considered several modifications that would make this a non-issue, but they all have the unfortunate result of making the program non-convex.
By my estimation my only option for an efficient solution would be a solver that uses a penalty method to approach a solution from the outside of the feasible region. Does anyone know a decent solver for this?
My current objective function is a sum of parabolic cylinders.
Can you just find the vertices of the feasible region and then take the one which minimizes the objective function? This should just involve a bit of linear algebra and then a limited number of evaluations of the objective function.

Resources