Linear regression and matrix division in Julia - julia

The well known formula for OLS is (X'X)^(-1)X'y where X is nxK and y is nx1.
One way to implement this in Julia is (X'*X)\X'*y.
But I found that X\y gives the almost same output up to the tiny computational error.
Do they always compute the same thing (as long as n>k)? If so, which one should I use?

When X is square, there is a unique solution and LU-factorization (with pivoting) is a numerically-stable way to calculate this. That is the algorithm that backslash uses in this case.
When X is not square, which is the case in most regression problems, then there is no unique solution but there is a unique least square solution. The QR factorization method for solving Xβ = y is a numerically stable method for generating the least square solution, and in this case X\y uses the QR-factorization and thus gives the OLS solution.
Notice the words numerically stable. While (X'*X)\X'*y will theoretically always give the same result as backslash, in practice backslash (with the correct factorization choice) will be more precise. This is because the factorization algorithms are implemented to be numerically stable. Because of the change for floating point errors to accumulate when doing (X'*X)\X'*y, it's not recommended that you use this form for any real numerical work.
Instead, (X'*X)\X'*y is somewhat equivalent to an SVD factorization which is the most nuemrically stable algorithm, but also the most expensive (in fact, it's basically writing out the Moore-Penrose pseudoinverse which is how an SVD factorization is used to solve a linear system). To directly do an SVD factorization using a pivoted SVD, do svdfact(X) \ y on v0.6 or svd(X) \ y on v0.7. Doing this directly is more stable than (X'*X)\X'*y. Note that qrfact(X) \ y or qr(X) \ y (v0.7) is for QR. See the factorizations portion of the documentation for more details on all of the choices.

Following the documentation the result of X\y is (there notation \(A, B) is used not X and y):
For rectangular A the result is the minimum-norm least squares solution
This is your case I guess as you assume n>k (so your matrix is not square). So you can safely use X\y. Actually it is better to use it than the standard formula as you will get a result even if rank of X is less than min(n,k), whereas standard formula (X'*X)^(-1)*X'*y will fail or produce numerically unstable result if X'*X is nearly singular.
If X would be square (this is not your case) then we have a bit different rule in the documentation:
For input matrices A and B, the result X is such that A*X == B when A is square
This means that the \ algorithm would produce an error if your matrix were singular or produce numerically unstable results if the matrix were nearly singular (in practice most often lu function that is called internally for general dense matrices may throw SingularException).
If you want a catch-all solution (for square and non square matrices) then qr(X, Val(true)) \ y can be used.

Short answer: No, use the first one (the well-known one).
Long answer:
The linear regression model is Xβ = y, and it's easily to derive β = X \ y, which is your second method. However, in most time (when X is not invertible), this is wrong, since you cannot simply left multiply X^-1. The correct way is to solve β = argmin{‖y - Xβ‖^2} instead, which leads to the first method.
To show they are not always the same, simple construct a case where X is not invertible:
julia> X = rand(10, 10)
10×10 Array{Float64,2}:
0.938995 0.32773 0.740556 0.300323 0.98479 0.48808 0.748006 0.798089 0.864154 0.869864
0.973832 0.99791 0.271083 0.841392 0.743448 0.0951434 0.0144092 0.785267 0.690008 0.494994
0.356408 0.312696 0.543927 0.951817 0.720187 0.434455 0.684884 0.72397 0.855516 0.120853
0.849494 0.989129 0.165215 0.76009 0.0206378 0.259737 0.967129 0.733793 0.798215 0.252723
0.364955 0.466796 0.227699 0.662857 0.259522 0.288773 0.691278 0.421251 0.593215 0.542583
0.126439 0.574307 0.577152 0.664301 0.60941 0.742335 0.459951 0.516649 0.732796 0.990509
0.430213 0.763126 0.737171 0.433884 0.85549 0.163837 0.997908 0.586575 0.257428 0.33239
0.28398 0.162054 0.481452 0.903363 0.780502 0.994575 0.131594 0.191499 0.702596 0.0967979
0.42463 0.142 0.705176 0.0481886 0.728082 0.709598 0.630134 0.139151 0.423227 0.942262
0.197805 0.526095 0.562136 0.648896 0.805806 0.168869 0.200355 0.557305 0.69514 0.227137
julia> y = rand(10, 1)
10×1 Array{Float64,2}:
0.7751785556478308
0.24185992335144801
0.5681904264574333
0.9134364924569847
0.20167825754443536
0.5776727022413637
0.05289808385359085
0.5841180308242171
0.2862768657856478
0.45152080383822746
julia> ((X' * X) ^ -1) * X' * y
10×1 Array{Float64,2}:
-0.3768345891121706
0.5900885565174501
-0.6326640292669291
-1.3922334538787071
0.06182039005215956
1.0342060710792016
0.045791973670925995
0.7237081408801955
1.4256831037950832
-0.6750765481219443
julia> X \ y
10×1 Array{Float64,2}:
-0.37683458911228906
0.5900885565176254
-0.6326640292676649
-1.3922334538790346
0.061820390052523294
1.0342060710793235
0.0457919736711274
0.7237081408802206
1.4256831037952566
-0.6750765481220102
julia> X[2, :] = X[1, :]
10-element Array{Float64,1}:
0.9389947787349187
0.3277301697101178
0.7405555185711721
0.30032257202572477
0.9847899425069042
0.48807977638742295
0.7480061513093117
0.79808859136911
0.8641540973071822
0.8698636291189576
julia> ((X' * X) ^ -1) * X' * y
10×1 Array{Float64,2}:
0.7456524759867015
0.06233042922132548
2.5600126098899256
0.3182206475232786
-2.003080524452619
0.272673133766017
-0.8550165639656011
0.40827327221785403
0.2994419115664999
-0.37876151249955264
julia> X \ y
10×1 Array{Float64,2}:
3.852193379477664e15
-2.097948470376586e15
9.077766998701864e15
5.112094484728637e15
-5.798433818338726e15
-2.0446050874148052e15
-3.300267174800096e15
2.990882423309131e14
-4.214829360472345e15
1.60123572911982e15

Related

Is there a way to optimize the calculation of Bernoulli Log-Likelihoods for many multivariate samples?

I currently have two Torch Tensors, p and x, which both have the shape of (batch_size, input_size).
I would like to calculate the Bernoulli log likelihoods for the given data, and return a tensor of size (batch_size)
Here's an example of what I'd like to do:
I have the formula for log likelihoods of Bernoulli Random variables:
\sum_i^d x_{i} ln(p_i) + (1-x_i) ln (1-p_i)
Say I have p Tensor:
[[0.6 0.4 0], [0.33 0.34 0.33]]
And say I have the x tensor for the binary inputs based on those probabilities:
[[1 1 0], [0 1 1]]
And I want to calculate the log likelihood for every sample, which would result in:
[[ln(0.6)+ln(0.4)], [ln(0.67)+ln(0.34)+ln(0.33)]]
Would it be possible to do this computation without the use of for loops?
I know I could use torch.sum(axis=1) to do the final summation between the logs, but is it possible to do the Bernoulli log-likelihood computation without the use of for loops? or use at most 1 for loop? I am trying to vectorize this operation as much as possible. I could've sworn we could use LaTeX for equations before, did something change or is it another website?
Though not a good practice, you can directly use the formula on the tensors as follows (works because these are element wise operations):
import torch
p = torch.tensor([
[0.6, 0.4, 0],
[0.33, 0.34, 0.33]
])
x = torch.tensor([
[1., 1, 0],
[0, 1, 1]
])
eps = 1e-8
bll1 = (x * torch.log(p+eps) + (1-x) * torch.log(1-p+eps)).sum(axis=1)
print(bll1)
#tensor([-1.4271162748, -2.5879497528])
Note that to avoid log(0) error, I have introduced a very small constant eps inside it.
A better way to do this is to use BCELoss inside nn module in pytorch.
import torch.nn as nn
bce = nn.BCELoss(reduction='none')
bll2 = -bce(p, x).sum(axis=1)
print(bll2)
#tensor([-1.4271162748, -2.5879497528])
Since pytorch computes the BCE as a loss, it prepends your formula with a negative sign. The attribute reduction='none' says that I do not want the computed losses to be reduced (averaged/summed) across the batch in any way. This is advisable to use since we do not need to manually take care of numerical stability and error handling (such as adding eps above.)
You can indeed verify that the two solutions actually return the same tensor (upto a tolerance):
torch.allclose(bll1, bll2)
# True
or the tensors (without summing each row):
torch.allclose((x * torch.log(p+eps) + (1-x) * torch.log(1-p+eps)), -bce(p, x))
# True
Feel free to ask for further clarifications.

How to fit a function to measurements with error?

I am currently using Measurements.jl for error propagation and LsqFit.jl for fitting functions to data. Is there a simple way to fit a function to data with errors? It would be no problem to use an other package if that makes things easier.
Thanks in advance for your help.
While in principle it should be possible to make these packages work together, the implementation of LsqFit.jl does not seem to play nicely with the Measurement type. However, if one writes a simple least-squares linear regression directly
# Generate test data, with noise
x = 1:10
y = 2x .+ 3
using Measurements
x_observed = (x .+ randn.()) .± 1
y_observed = (y .+ randn.()) .± 1
# Simple least-squares linear regression
# for an equation of the form y = a + bx
# using `\` for matrix division
linreg(x, y) = hcat(fill!(similar(x), 1), x) \ y
(a, b) = linreg(x_observed, y_observed)
then
julia> (a, b) = linreg(x_observed, y_observed)
2-element Vector{Measurement{Float64}}:
3.9 ± 1.4
1.84 ± 0.23
This ought to be able to work with either x uncertainties, y uncertainties, or both.
If you need a nonlinear least-squares fit, it should also be possible to extend the above approach to nonlinear least squares -- though for the latter it may be easier to just find where the incompatibility is in LsqFit.jl and make a PR.

Vectorizing code to calculate (squared) Mahalanobis Distiance

EDIT 2: this post seems to have been moved from CrossValidated to StackOverflow due to it being mostly about programming, but that means by fancy MathJax doesn't work anymore. Hopefully this is still readable.
Say I want to to calculate the squared Mahalanobis distance between two vectors x and y with covariance matrix S. This is a fairly simple function defined by
M2(x, y; S) = (x - y)^T * S^-1 * (x - y)
With python's numpy package I can do this as
# x, y = numpy.ndarray of shape (n,)
# s_inv = numpy.ndarray of shape (n, n)
diff = x - y
d2 = diff.T.dot(s_inv).dot(diff)
or in R as
diff <- x - y
d2 <- t(diff) %*% s_inv %*% diff
In my case, though, I am given
m by n matrix X
n-dimensional vector mu
n by n covariance matrix S
and want to find the m-dimensional vector d such that
d_i = M2(x_i, mu; S) ( i = 1 .. m )
where x_i is the ith row of X.
This is not difficult to accomplish using a simple loop in python:
d = numpy.zeros((m,))
for i in range(m):
diff = x[i,:] - mu
d[i] = diff.T.dot(s_inv).dot(diff)
Of course, given that the outer loop is happening in python instead of in native code in the numpy library means it's not as fast as it could be. $n$ and $m$ are about 3-4 and several hundred thousand respectively and I'm doing this somewhat often in an interactive program so a speedup would be very useful.
Mathematically, the only way I've been able to formulate this using basic matrix operations is
d = diag( X' * S^-1 * X'^T )
where
x'_i = x_i - mu
which is simple to write a vectorized version of, but this is unfortunately outweighed by the inefficiency of calculating a 10-billion-plus element matrix and only taking the diagonal... I believe this operation should be easily expressible using Einstein notation, and thus could hopefully be evaluated quickly with numpy's einsum function, but I haven't even begun to figure out how that black magic works.
So, I would like to know: is there either a nicer way to formulate this operation mathematically (in terms of simple matrix operations), or could someone suggest some nice vectorized (python or R) code that does this efficiently?
BONUS QUESTION, for the brave
I don't actually want to do this once, I want to do it k ~ 100 times. Given:
m by n matrix X
k by n matrix U
Set of n by n covariance matrices each denoted S_j (j = 1..k)
Find the m by k matrix D such that
D_i,j = M(x_i, u_j; S_j)
Where i = 1..m, j = 1..k, x_i is the ith row of X and u_j is the jth row of U.
I.e., vectorize the following code:
# s_inv is (k x n x n) array containing "stacked" inverses
# of covariance matrices
d = numpy.zeros( (m, k) )
for j in range(k):
for i in range(m):
diff = x[i, :] - u[j, :]
d[i, j] = diff.T.dot(s_inv[j, :, :]).dot(diff)
First off, it seems like maybe you're getting S and then inverting it. You shouldn't do that; it's slow and numerically inaccurate. Instead, you should get the Cholesky factor L of S so that S = L L^T; then
M^2(x, y; L L^T)
= (x - y)^T (L L^T)^-1 (x - y)
= (x - y)^T L^-T L^-1 (x - y)
= || L^-1 (x - y) ||^2,
and since L is triangular L^-1 (x - y) can be computed efficiently.
As it turns out, scipy.linalg.solve_triangular will happily do a bunch of these at once if you reshape it properly:
L = np.linalg.cholesky(S)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis]).T, lower=True)
d = np.einsum('ij,ij->j', y, y)
Breaking that down a bit, y[i, j] is the ith component of L^-1 (X_j - \mu). The einsum call then does
d_j = \sum_i y_{ij} y_{ij}
= \sum_i y_{ij}^2
= || y_j ||^2,
like we need.
Unfortunately, solve_triangular won't vectorize across its first argument, so you should probably just loop there. If k is only about 100, that's not going to be a significant issue.
If you are actually given S^-1 rather than S, then you can indeed do this with einsum more directly. Since S is quite small in your case, it's also possible that actually inverting the matrix and then doing this would be faster. As soon as n is a nontrivial size, though, you're throwing away a lot of numerical accuracy by doing this.
To figure out what to do with einsum, write everything in terms of components. I'll go straight to the bonus case, writing S_j^-1 = T_j for notational convenience:
D_{ij} = M^2(x_i, u_j; S_j)
= (x_i - u_j)^T T_j (x_i - u_j)
= \sum_k (x_i - u_j)_k ( T_j (x_i - u_j) )_k
= \sum_k (x_i - u_j)_k \sum_l (T_j)_{k l} (x_i - u_j)_l
= \sum_{k l} (X_{i k} - U_{j k}) (T_j)_{k l} (X_{i l} - U_{j l})
So, if we make arrays X of shape (m, n), U of shape (k, n), and T of shape (k, n, n), then we can write this as
diff = X[np.newaxis, :, :] - U[:, np.newaxis, :]
D = np.einsum('jik,jkl,jil->ij', diff, T, diff)
where diff[j, i, k] = X_[i, k] - U[j, k].
Dougal nailed this one with an excellent and detailed answer, but thought I'd share a small modification that I found increases efficiency in case anyone else is trying to implement this. Straight to the point:
Dougal's method was as follows:
def mahalanobis2(X, mu, sigma):
L = np.linalg.cholesky(sigma)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis,:]).T, lower=True)
return np.einsum('ij,ij->j', y, y)
A mathematically equivalent variant I tried is
def mahalanobis2_2(X, mu, sigma):
# Cholesky decomposition of inverse of covariance matrix
# (Doing this in either order should be equivalent)
linv = np.linalg.cholesky(np.linalg.inv(sigma))
# Just do regular matrix multiplication with this matrix
y = (X - mu[np.newaxis,:]).dot(linv)
# Same as above, but note different index at end because the matrix
# y is transposed here compared to above
return np.einsum('ij,ij->i', y, y)
Ran both versions head-to-head 20x using identical random inputs and recorded the times (in milliseconds). For X as a 1,000,000 x 3 matrix (mu and sigma 3 and 3x3) I get:
Method 1 (min/max/avg): 30/62/49
Method 2 (min/max/avg): 30/47/37
That's about a 30% speedup for the 2nd version. I'm mostly going to be running this in 3 or 4 dimensions but to see how it scaled I tried X as 1,000,000 x 100 and got:
Method 1 (min/max/avg): 970/1134/1043
Method 2 (min/max/avg): 776/907/837
which is about the same improvement.
I mentioned this in a comment on Dougal's answer but adding here for additional visibility:
The first pair of methods above take a single center point mu and covariance matrix sigma and calculate the squared Mahalanobis distance to each row of X. My bonus question was to do this multiple times with many sets of mu and sigma and output a two-dimensional matrix. The set of methods above can be used to accomplish this with a simple for loop, but Dougal also posted a more clever example using einsum.
I decided to compare these methods with each other by using them to solve the following problem: Given k d-dimensional normal distributions (with centers stored in rows of k by d matrix U and covariance matrices in the last two dimensions of the k by d by d array S), find the density at the n points stored in rows of the n by d matrix X.
The density of a multivariate normal distribution is a function of the squared Mahalanobis distance of the point to the mean. Scipy has an implementation of this as scipy.stats.multivariate_normal.pdf to use as a reference. I ran all three methods against each other 10x using identical random parameters each time, with d=3, k=96, n=5e5. Here are the results, in points/sec:
[Method]: (min/max/avg)
Scipy: 1.18e5/1.29e5/1.22e5
Fancy 1: 1.41e5/1.53e5/1.48e5
Fancy 2: 8.69e4/9.73e4/9.03e4
Fancy 2 (cheating version): 8.61e4/9.88e4/9.04e4
where Fancy 1 is the better of the two methods above and Fancy2 is Dougal's 2nd solution. Since the Fancy 2 needs to calculate the inverses of all the covariance matrices I also tried a "cheating version" where it was passed these as a parameter, but it looks like that didn't make a difference. I had planned on including the non-vectorized implementation but that was so slow it would have taken all day.
What we can take away from this is that using Dougal's first method is about 20% faster than however Scipy does it. Unfortunately despite its cleverness the 2nd method is only about 60% as fast as the first. There are probably some other optimizations that can be done but this is already fast enough for me.
I also tested how this scaled with higher dimensionality. With d=100, k=96, n=1e4:
Scipy: 7.81e3/7.91e3/7.86e3
Fancy 1: 1.03e4/1.15e4/1.08e4
Fancy 2: 3.75e3/4.10e3/3.95e3
Fancy 2 (cheating version): 3.58e3/4.09e3/3.85e3
Fancy 1 seems to have an even bigger advantage this time. Also worth noting that Scipy threw a LinAlgError 8/10 times, probably because some of my randomly-generated 100x100 covariance matrices were close to singular (which may mean that the other two methods are not as numerically stable, I did not actually check the results).

Generic function for solving n-order polynomial roots in Julia

All,
I've just been starting to play around with the Julia language and am enjoying it quite a bit. At the end of the 3rd tutorial there's an interesting problem: genericize the quadratic formula such that it solves for the roots of any n-order polynomial equation.
This struck me as (a) an interesting programming problem and (b) an interesting Julia problem. Has anyone out there solved this one? For reference, here is the Julia code with a couple toy examples. Again, the idea is to make this generic for any n-order polynomial.
Cheers,
Aaron
function derivative(f)
return function(x)
# pick a small value for h
h = x == 0 ? sqrt(eps(Float64)) : sqrt(eps(Float64)) * x
# floating point arithmetic gymnastics
xph = x + h
dx = xph - x
# evaluate f at x + h
f1 = f(xph)
# evaluate f at x
f0 = f(x)
# divide the difference by h
return (f1 - f0) / dx
end
end
function quadratic(f)
f1 = derivative(f)
c = f(0.0)
b = f1(0.0)
a = f(1.0) - b - c
return (-b + sqrt(b^2 - 4a*c + 0im))/2a, (-b - sqrt(b^2 - 4a*c + 0im))/2a
end
quadratic((x) -> x^2 - x - 2)
quadratic((x) -> x^2 + 2)
The package PolynomialRoots.jl provides the function roots() to find all (real and complex) roots of polynomials of any order. The only mandatory argument is the array with coefficients of the polynomial in ascending order.
For example, in order to find the roots of
6x^5 + 5x^4 + 3x^2 + 2x + 1
after loading the package (using PolynomialRoots) you can use
julia> roots([1, 2, 3, 4, 5, 6])
5-element Array{Complex{Float64},1}:
0.294195-0.668367im
-0.670332+2.77556e-17im
0.294195+0.668367im
-0.375695-0.570175im
-0.375695+0.570175im
The package is a Julia implementation of the root-finding algorithm described in this paper: http://arxiv.org/abs/1203.1034
PolynomialRoots.jl has also support for arbitrary precision calculation. This is useful for solving equation that cannot be solved in double precision. For example
julia> r = roots([94906268.375, -189812534, 94906265.625]);
julia> (r[1], r[2])
(1.0000000144879793 - 0.0im,1.0000000144879788 + 0.0im)
gives the wrong result for the polynomial, instead passing the input array in arbitrary precision forces arbitrary precision calculations that provide the right answer (see https://en.wikipedia.org/wiki/Loss_of_significance):
julia> r = roots([BigFloat(94906268.375), BigFloat(-189812534), BigFloat(94906265.625)]);
julia> (Float64(r[1]), Float64(r[2]))
(1.0000000289759583,1.0)
There are no algebraic formulae for a general polynomials of degree five and above (infact there cant be see here). So theoretically, you could proceed using the same methodology for solutions to cubics and quartics, but even that would be a lot of hard work given very unwieldy formulae for roots of quartics. You could also use a CAS like SymPy to find out those formulae.

Computation of numerical integral involving convolution

I have to solve the following convolution related numerical integration problem in R or perhaps computer algebra system like Maxima.
Integral[({k(y)-l(y)}^2)dy]
where
k(.) is the pdf of a standard normal distribution
l(y)=integral[k(z)*k(z+y)dz] (standard convolution)
z and y are scalars
The domain of y is -inf to +inf.
The integral in function l(.) is an indefinite integral. Do I need to add any additional assumption on z to obtain this?
Thank you.
Here is a symbolic solution from Mathematica:
R does not do symbolic integration, just numerical integration. There is the Ryacas package which intefaces with Yacas, a symbolic math program that may help.
See the distr package for possible help with the convolution parts (it will do the convolutions, I just don't know if the result will be integrable symbolicly).
You can numerically integrate the convolutions from distr using the integrate function, but all the parameters need to be specified as numbers not variables.
For the record, here is the same problem solved with Maxima 5.26.0.
(%i2) k(u):=exp(-(1/2)*u^2)/sqrt(2*%pi) $
(%i3) integrate (k(x) * k(y + x), x, minf, inf);
(%o3) %e^-(y^2/4)/(2*sqrt(%pi))
(%i4) l(y) := ''%;
(%o4) l(y):=%e^-(y^2/4)/(2*sqrt(%pi))
(%i5) integrate ((k(y) - l(y))^2, y, minf, inf);
(%o5) ((sqrt(2)+2)*sqrt(3)-2^(5/2))/(4*sqrt(3)*sqrt(%pi))
(%i6) float (%);
(%o6) .02090706601281356
Sorry for the late reply. Leaving this here in case someone finds it by searching.
I try to do something similar in matlab, where I convolute two random (Rayleigh distributed) variables. The result of fz_fun is equal to fy_fun, I don't know why. Maybe some here knows it?
sigma1 = 0.45;
sigma2 = 0.29;
fx_fun =#(x) [0*x(x<0) , (x(x>=0)./sigma1^2).*exp(-0.5*(x(x>=0)./sigma1).^2)];
fy_fun =#(y) [0*y(y<0) , (y(y>=0)./sigma2^2).*exp(-0.5*(y(y>=0)./sigma2).^2)];
% Rayleigh distribution of random var X,Y:
step = 0.1;
x= -2:step:3;
y= -2:step:3;
%% Convolution:
z= y;
fz = zeros(size(y));
for i = 1:length(y)
fz_fun(i) = integral(#(z) fy_fun(y(i)).*fx_fun(z-y(i)),0,Inf); % probability density of random variable z= x+y
end

Resources