From expensive search to Integer Programming or Constraint Programming? - math

Consider m by n matrices M, all of whose entries are 0 or 1. For a given M, the question is whether there exists a non zero vector v, all of whose entries are -1, 0 or 1 for which Mv = 0. For example,
[0 1 1 1]
M_1 = [1 0 1 1]
[1 1 0 1]
In this example, there is no such vector v.
[1 0 0 0]
M_2 = [0 1 0 0]
[0 0 1 0]
In this example, the vector (0,0,0,1) gives M_2v = 0.
Given an m and n, I would like to find if there exists such an M so that there is no non-zero v such that Mv = 0.
If m = 3 and n = 4 then the answer is yes as we can see above.
I am currently solving this problem by trying all different M and v which is very expensive.
However, is it possible to express the problem as an integer
programming problem or constraint programming problem so I can use an
existing software package, such as SCIP instead which might be more
efficient.

This question is probably more mathematical than progamming. I haven't found the final answer yet, but at least some ideas are here:
We can re-state the problem in the following way.
Problem A: Fix positive integers m and n. Let S be the set of n-dimensional vectors whose entries are 0 or 1. Does there exist any m by n matrix M whose entries are 0 or 1, such that, for any two different vectors v_1 and v_2 in S, the vectors Mv_1 and Mv_2 are different. (Or, you may say that, the matrix M, considered as an application from n-dimensional vectors to m-dimensional vectors, is injective on the set S.)
In brief: given the pair (m, n), does there exist such an injective M?
Problem A is equivalent to the original problem. Indeed, if Mv_1 = Mv_2 for two different v_1 and v_2 in S, then we have M(v_1 - v_2) = 0, and the vector v_1 - v_2 will have only 0, 1, - 1 as entries. The inverse is obviously also true.
Another reinterpretation is:
Problem B: Let m, n be a positive integer and S be the set of n-dimensional vectors whose entries are 0 and 1. Can we find m vectors r_1, ..., r_m in S, such that, for any pair of different vectors v_1 and v_2 in S, there exists an r_i, which satisfies <v_1, r_i> != <v_2, r_i>? Here <x, y> is the usual inner product.
In brief: can we choose m vectors in S to distinguish everyone in S by taking inner product with the chosen ones?
Problem B is equivalent to Problem A, because you can identify the matrix M with m vectors in S.
In the following, I will use both descriptions of the problem freely.
Let's call the pair (m, n) a "good pair" if the answer to Problem A (or B) is yes.
With the description of Problem B, it is clear that, for a given n, there is a minimal m such that (m, n) is a good pair. Let us write m(n) for this minimal m associated to n.
Similarly, for a given m, there is a maximal n such that (m, n) is good. This is because, if (m, n) is good, i.e. there is an injective M as stated in Problem A, then for any n' <= n, erasing any n - n' columns of M will give an injective M'. Let us write n(m) for this maximal n associated to m.
So the task becomes to calculate the functions m(n) and/or n(m).
We first prove several lemmas:
Lemma 1: We have m(n + k) <= m(n) + m(k).
Proof: If M is an m(n) by n injective matrix for the pair (m(n), n) and K is an m(k) by k injective matrix for the pair (m(k), k), then the (m(n) + n(k)) by (n + k) matrix
[M 0]
[0 K]
works for the pair (m(n) + 1, n + 1). To see this, let v_1 and v_2 be any pair of different (n + k)-dimensional vectors. We may cut both of them into two pieces: the first n entries, and the last k entries. If the first pieces of them are not equal, then they can be distinguished by one of the first m(n) rows of the above matrix; if the first pieces of them are equal, then the second pieces of them must be different, hence they can be distinguished by one of the last m(k) rows of the above matrix.
Remark: The sequence m(n) is thus a subadditive sequence.
A simple corollary:
Corollary 2: We have m(n + 1) <= m(n) + 1, hence m(n) <= n.
Proof: Take k = 1 in Lemma 1.
Note that, from other known values of m(n) you can get better upper bounds. For example, since we know that m(4) <= 3, we have m(4n) <= 3n. Anyway, these always give you O(n) upper bounds.
The next lemma gives you a lower bound.
Lemma 3: m(n) >= n / log2(n + 1).
Proof: Let T be the set of m(n)-dimensional vectors whose entries lie in {0, 1, ..., n}. Any m(n) by n matrix M gives a map from S to T, sending v to Mv.
Since there exists an M such that the above map is injective, then necessarily the size of the set T is at least the size of the set S. The size of T is (n + 1)^m, and the size of S is 2^n, thus we have:
(n + 1)^m(n) >= 2^n
or equivalently, m(n) >= n / log2(n + 1).
Back to programming
I have to say that I haven't figured out a good algorithm.
You might restate the problem as a Set Cover Problem, as follows:
Let U be the set of n dimensional vectors with entries 1, 0 or - 1, and let S be as above. Every vector w in S gives a subset C_w of U: C_w = {v in U: <w, v> != 0}. The question is then: can we find m vectors w such that the union of the subsets C_w is equal to U.
The general Set Cover Problem is NP complete, but in the above Wiki link there is an integer linear program formulation.
Anyway, this cannot take you much further than n = 10, I guess.
I'll keep editting this answer if I have further results.

i think using Boolean matrix multiplication will allow you to solve the Mv=0 problem with only 1's & 0's more efficiently. Using this method you should be able to solve without worrying about rank deficiencies due to the RHS equaling zero. Here is a link to documentation on some algorithms for using BMM:
http://theory.stanford.edu/~virgi/cs367/lecture2.pdf

If I understand the question, you are asking for a given m,n if there exists a Matrix M, (Linear Transformation), with a trivial kernal, that is Ker(M)={0}.
Recall that this is the same as the Nullspace of M being zero 0, Null(M)=0.
For the system Mv=0 the nullspace is {0} if the rank of the matrix M is equal to the dimension of v. So your question comes down to asking about the existence of a mxn matrix with rank dim(v)=m.
The problem in this form has been discussed here
Generate "random" matrix of certain rank over a fixed set of elements
You can also frame this question in terms of determinants because if M has determinant=0 then the nullspace is nontrivial. So you can think about this question in terms of constucting a matrix with entries in {-1,0,1} with a desired determinant.
I hope this helps.

Related

Minimum number of increments so that all elements have a common divisor

I got this problem in an interview recently:
Given a set of numbers X = [X_1, X_2, ...., X_n] where X_i <= 500 for 1 <= i <= n. Increment the numbers (only positive increments) in the set so that each element in the set has a common divisor >=2, and such that the sum of all increments is minimized.
For example, if X = [5, 7, 7, 7, 7] the new set would be X = [7, 7, 7, 7, 7] Since you can add 2 to X_1. X = [6, 8, 8, 8, 8] has a common denominator of 2 but is not correct since we're adding 6 (add 2 to 5 and 1 to each of the 4 7's).
I had a seemingly working solution (as in it passed all the test cases) that loops through the prime numbers < 500 and for each X_i in X finds the closest multiple of the prime number greater than X_i.
function closest_multiple(x, y)
return ceil(x/y)*y
min_increment = inf
for each prime_number < 500:
total_increment = 0
for each element X_i in X:
total_increment += closest_multiple(X_i, prime_number) - X_i
min_increment = min(min_increment, total_increment)
return min_increment
It's technically O(n) but is there a better way to solve this? I've been suggested to use dynamic programming but am unsure how that would fit in here.
Constant-bounded entries case
When X_i is bounded by a constant, the best time you can achieve asymptotically is O(n), since it takes at least that long to read all of your inputs. There are some practical improvements:
Filter out duplicates, so you work with a list of (element, frequency) pairs.
Early stopping in your loop.
Faster computation of closest_multiple(x, p) - x. This is slightly hardware/language dependent, but a single integer modulus op is almost certainly faster than an int -> float cast, float division, ceiling() call, and multiplication on the same magnitude numbers.
freq_counts <- Initialize-Counter(X) // List of (element, freq) pairs
min_increment = inf
for each prime_number < 500:
total_increment = 0
for each pair X_i, freq in freq_counts:
total_increment += (prime_number - (X_i % prime_number)) * freq
if total_increment >= min_increment: break
min_increment = min(min_increment, total_increment)
return min_increment
Large entries case
With uniformly chosen random data, the answer is almost always from using '2' as the divisor, and much larger prime divisors are vanishingly unlikely. However, let's solve for that worst case scenario.
Here, let max(X) = M, so that our input size is O(n (log M)) bits. We want a solution that's sub-exponential in that input size, so finding all primes below M (or even sqrt(M)) is out of the question. We're looking for any prime that gives us a min-total-increment; we'll call such a prime a min-prime. After finding such a prime, we can get the min-total-increment in linear time. We'll use a factoring approach along with two observations.
Observation 1: The answer is always at most n, since the increment needed for the prime 2 to divide X_i is at most 1.
Observation 2: We're trying to find primes that divide X_i or a number slightly larger than X_i for a large fraction of our entries X_i. Let Consecutive-Product-Divisors[i] be the set of all primes dividing either of X_i or X_i+1, which I'll abbreviate CPD[i]. This is exactly the set of all primes which divide X_i * (1 + X_i).
(Obs. 2 Continued) If U is a known upper bound on our answer (here, at most n), and p is a min-prime for X, then p must divide either X_i or X_i + 1 for at least N - U/2 of our CPD entries. Use frequency counts on the CPD array to find all such primes.
Once you have a list of candidate primes (all min-primes are guaranteed to be in this list), you can test each one individually using your algorithm. Since a number k can have at most O(log k) distinct prime divisors, this gives O(n log M) possible distinct primes that divide at least half of the numbers
[X_1*(1 + X_1), X_2*(1 + X_2), ... X_n*(1 + X_n)] that make up our candidate list. It's possible you can lower this bound with some more careful analysis, but it likely won't strongly affect the asymptotic runtime of the whole algorithm.
A more optimal complexity for large entries
The complexity of this solution is hard to write in short form, because the bottleneck is factoring n numbers of maximum size M, plus O(n^2 log M) arithmetic (i.e. addition, subtraction, multiply, modulo) operations on numbers of maximum size M. That doesn't mean the runtime is unknown: If you select any integer factoring algorithm and large-integer-arithmetic algorithms, you can derive the runtime exactly. Unfortunately, because of factoring, the best known runtime of the above algorithm is super-polynomial (but sub-exponential).
How can we do better? I did find a more complicated solution, based on Greatest Common Divisors (GCD) and dynamic-programming-like that runs in polynomial time (although likely much slower on non-astronomical-size inputs) since it doesn't rely on factoring.
The solution relies on the fact that at least one of the following two statements is true:
The number 2 is a min-prime for X, or
For at least one value of i, 1 <= i <= n there is an optimal solution where X_i remains unincremented, i.e. where one of the divisors of X_i produces a min-total-increment.
GCD-Based polynomial time algorithm
We can test 2 and all small primes quickly for their minimum costs. In fact, we'll test all primes p, p <= n, which we can do in polynomial time, and factor out these primes from X_i and its first n increments. This leads us to the following algorithm:
// Given: input list X = [X_1, X_2, ... X_n].
// Subroutine compute-min-cost(list A, int p) is
// just the inner loop of the above algorithm.
min_increment = inf;
for each prime p <= n:
min_increment = min(min_increment, compute-min-cost(X, p));
// Initialize empty, 2-D, n x (n+1) list Y[n][n+1], of offset X-values
for all 1 <= i <= n:
for all 0 <= j <= n:
Y[i][j] <- X[i] + j;
for each prime p <= n: // Factor out all small prime divisors from Y
for each Y[i][j]:
while Y[i][j] % p == 0:
Y[i][j] /= p;
for all 1 <= i <= n: // Loop 1
// Y[i][0] is the test 'unincremented' entry
// Initialize empty hash-tables 'costs' and 'new_costs'
// Keys of hash-tables are GCDs,
// Values are a running sum of increment-costs for that GCD
costs[Y[i][0]] = 0;
for all 1 <= k <= n: // Loop 2
if i == k: continue;
clear all entries from new_costs // or reinitialize to empty
for all 0 <= j < n: // Loop 3
for each Key in costs: // Loop 4
g = GCD(Key, Y[k][j]);
if g == 1: continue;
if g is not a key in new_costs:
new_costs[g] = j + costs[Key];
else:
new_costs[g] = min(new_costs[g], j + costs[Key]);
swap(costs, new_costs);
if costs is not empty:
min_increment = min(min_increment, smallest Value in costs);
return min_increment;
The correctness of this solution follows from the previous two observations, and the (unproven, but straightforward) fact that there is a list
[X_1 + r_1, X_2 + r_2, ... , X_n + r_n] (with 0 <= r_i <= n for all i) whose GCD is a divisor with minimum increment cost.
The runtime of this solution is trickier: GCDs can easily be computed in O(log^2(M)) time, and the list of all primes up to n can be computed in low poly(n) time. From the loop structure of the algorithm, to prove a polynomial bound on the whole algorithm, it suffices to show that the maximum size of our 'costs' hash-table is polynomial in log M. This is where the 'factoring-out' of small primes comes into play. After iteration k of loop 2, the entries in costs are (Key, Value) pairs, where each Key is the GCD of k + 1 elements:
our initial Y[i][0], and [Y[1][j_1], Y[2][j_2], ... Y[k][j_k]] for some 0 <= j_l < n. The Value for this Key is the minimum increment sum needed for this divisor (i.e. sum of the j_l) over all possible choices of j_l.
There are at most O(log M) unique prime divisors of Y[i][0]. Each such prime divides at most one key in our 'costs' table at any time: Since we've factored out all prime divisors below n, any remaining prime divisor p can divide at most one of the n consecutive numbers in any Y[j] = [X_j, 1 + X_j, ... n-1 + X_j]. This means the overall algorithm is polynomial, and has a runtime below O(n^4 log^3(M)).
From here, the open questions are whether a simpler algorithm exists, and how much better than this bound can you achieve. You can definitely optimize this algorithm (including using the early-stopping and frequency counts from before). It's also likely that better bounds on counting large-and-distinct-prime-divisors for consecutive numbers shows this solution is already better than that stated runtime, but a simplification of this solution would be very interesting.

Construct a bijective function to map arbitrary integer from [1, n] to [1, n] randomly

I want to construct a bijective function f(k, n, seed) from [1,n] to [1,n] where 1<=k<=n and 1<=f(k, n, seed)<=n for each given seed and n. The function actually should return a value from a random permutation of 1,2,...,n. The randomness is decided by the seed. Different seed may corresponds to different permutation. I want the function f(k, n, seed)'s time complexity to be O(1) for each 1<=k<=n and any given seed.
Anyone knows how can I construct such a function? The randomness is allowed to be pseudo-randomness. n can be very large (e.g. >= 1e8).
No matter how you do it, you will always have to store a list of numbers still available or numbers already used ... A simple possibility would be the following
const avail = [1,2,3, ..., n];
let random = new Random(seed)
function f(k,n) {
let index = random.next(n - k);
let result = avail[index]
avail[index] = avail[n-k];
}
The assumptions for this are the following
the array avail is 0-indexed
random.next(x) creates an random integer i with 0 <= i < x
the first k to call the function f with is 0
f is called for contiguous k 0, 1, 2, 3, ..., n
The principle works as follows:
avail holds all numbers still available for the permution. When you take a random index, the element at that index is the next element of the permutation. Then instead of slicing out that element from the array, which is quite expensive, you just replace the currently selected element with the last element in the avail array. In the next iteration you (virtually) decrease the size of the avail array by 1 by decreasing the upper limit for the random by one.
I'm not sure, how secure this random permutation is in terms of distribution of the values, ie for instance it may happen that a certain range of numbers is more likely to be in the beginning of the permuation or in the end of the permutation.
A simple, but not very 'random', approach would be to use the fact that, if a is relatively prime to n (ie they have no common factors), then
x-> (a*x + b)%n
is a permutation of {0,..n-1} to {0,..n-1}. To find the inverse of this, you can use the extended euclidean algorithm to find k and l so that
1 = gcd(a,n) = k*a+l*n
for then the inverse of the map above is
y -> (k*x + c) mod n
where c = -k*b mod n
So you could choose a to be a 'random' number in {0,..n-1} that is relatively prime to n, and b to be any number in {0,..n-1}
Note that you'll need to do this in 64 bit arithmetic to avoid overflow in computing a*x.

Run time of the following code

So my code is
function mystery(n, k):
if k ≥ n
return foo(n)
sum = 0
for i = k to n
sum = sum + mystery(n, k+1)
return sum
I have created a tree and the answer I am getting is $n^2*(n-1)!$. where foo is O(n. )Is it correct?
You are correct. If you were to turn the recursion into iteration, you would have loops that run triangle(n-k) times -- where triangle is the triangle function, the sum of integers 1 through N. The formula for that is
triangle(N) = N * (N-1) / 2
Multiply this by O(n), drop the 1/2 constant, and that yields your answer.
[ N.B. Since k is a constant for complexity purposes, you also drop that]

How to find out the sum of products of all combinations of first n natural numbers taken r at a time?n and r are both constants [duplicate]

This question already has answers here:
Sum of multiplication of all combination of m element from an array of n elements
(3 answers)
Closed 7 years ago.
Suppose n=10,r=3,then i want to find the sum of the products of numbers taken 3 at a time from 1 to 10 and all the combinations considered.There will be nCr possible products and i want their sum.
What you want is the degree 7 coefficient of
(x+1)(x+2)·…·(x+10)
You can compute all coefficients of that polynomial in O(n²) operations by adding one factor after the other.
For the computation of coefficients from roots see How to efficiently find coefficients of a polynomial from its roots?, Efficient calculation of polynomial coefficients from its roots, Find the coefficients of the polynomial given its roots
A recurrence for S(n, m) where S is the sum you want is: S(n + 1, m) = S(n, m) + (n + 1)*S(n, m - 1), with S(n, 1) = n*(n + 1)/2 and S(m, m) = m!.
The term (n + 1)*S(n, m - 1) comes from the additional terms in the sum which are created by increasing n by 1: these terms are all the ones you get for S(n, m - 1) (i.e. take one fewer term in the product) and append n + 1 to each one.
I don't know how to solve recurrences like that in general. In this case, I would build a table of terms S(n, m) which would look like a triangle. The bottom row would be for S(n, 1) which is n*(n + 1)/2. The diagonal would be S(m, m) which is m!. So you could try to guess a formula which gives general entries in the table and then add them all up.
A Python program to compute S(n, m) for specific values of n and m might look like:
import math
def S(n, m):
if m == 1:
return n*(n + 1)/2
elif n == m:
return math.factorial(m)
else:
return S(n - 1, m) + n*S(n - 1, m - 1)
which requires, I believe, approximately m*n unique calls to S (but there are many duplicates). Of course it's easy to write this in some other language.
For example, S(10, 3) = 18150.
EDIT: revised the estimated number of operations.

Vectorizing code to calculate (squared) Mahalanobis Distiance

EDIT 2: this post seems to have been moved from CrossValidated to StackOverflow due to it being mostly about programming, but that means by fancy MathJax doesn't work anymore. Hopefully this is still readable.
Say I want to to calculate the squared Mahalanobis distance between two vectors x and y with covariance matrix S. This is a fairly simple function defined by
M2(x, y; S) = (x - y)^T * S^-1 * (x - y)
With python's numpy package I can do this as
# x, y = numpy.ndarray of shape (n,)
# s_inv = numpy.ndarray of shape (n, n)
diff = x - y
d2 = diff.T.dot(s_inv).dot(diff)
or in R as
diff <- x - y
d2 <- t(diff) %*% s_inv %*% diff
In my case, though, I am given
m by n matrix X
n-dimensional vector mu
n by n covariance matrix S
and want to find the m-dimensional vector d such that
d_i = M2(x_i, mu; S) ( i = 1 .. m )
where x_i is the ith row of X.
This is not difficult to accomplish using a simple loop in python:
d = numpy.zeros((m,))
for i in range(m):
diff = x[i,:] - mu
d[i] = diff.T.dot(s_inv).dot(diff)
Of course, given that the outer loop is happening in python instead of in native code in the numpy library means it's not as fast as it could be. $n$ and $m$ are about 3-4 and several hundred thousand respectively and I'm doing this somewhat often in an interactive program so a speedup would be very useful.
Mathematically, the only way I've been able to formulate this using basic matrix operations is
d = diag( X' * S^-1 * X'^T )
where
x'_i = x_i - mu
which is simple to write a vectorized version of, but this is unfortunately outweighed by the inefficiency of calculating a 10-billion-plus element matrix and only taking the diagonal... I believe this operation should be easily expressible using Einstein notation, and thus could hopefully be evaluated quickly with numpy's einsum function, but I haven't even begun to figure out how that black magic works.
So, I would like to know: is there either a nicer way to formulate this operation mathematically (in terms of simple matrix operations), or could someone suggest some nice vectorized (python or R) code that does this efficiently?
BONUS QUESTION, for the brave
I don't actually want to do this once, I want to do it k ~ 100 times. Given:
m by n matrix X
k by n matrix U
Set of n by n covariance matrices each denoted S_j (j = 1..k)
Find the m by k matrix D such that
D_i,j = M(x_i, u_j; S_j)
Where i = 1..m, j = 1..k, x_i is the ith row of X and u_j is the jth row of U.
I.e., vectorize the following code:
# s_inv is (k x n x n) array containing "stacked" inverses
# of covariance matrices
d = numpy.zeros( (m, k) )
for j in range(k):
for i in range(m):
diff = x[i, :] - u[j, :]
d[i, j] = diff.T.dot(s_inv[j, :, :]).dot(diff)
First off, it seems like maybe you're getting S and then inverting it. You shouldn't do that; it's slow and numerically inaccurate. Instead, you should get the Cholesky factor L of S so that S = L L^T; then
M^2(x, y; L L^T)
= (x - y)^T (L L^T)^-1 (x - y)
= (x - y)^T L^-T L^-1 (x - y)
= || L^-1 (x - y) ||^2,
and since L is triangular L^-1 (x - y) can be computed efficiently.
As it turns out, scipy.linalg.solve_triangular will happily do a bunch of these at once if you reshape it properly:
L = np.linalg.cholesky(S)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis]).T, lower=True)
d = np.einsum('ij,ij->j', y, y)
Breaking that down a bit, y[i, j] is the ith component of L^-1 (X_j - \mu). The einsum call then does
d_j = \sum_i y_{ij} y_{ij}
= \sum_i y_{ij}^2
= || y_j ||^2,
like we need.
Unfortunately, solve_triangular won't vectorize across its first argument, so you should probably just loop there. If k is only about 100, that's not going to be a significant issue.
If you are actually given S^-1 rather than S, then you can indeed do this with einsum more directly. Since S is quite small in your case, it's also possible that actually inverting the matrix and then doing this would be faster. As soon as n is a nontrivial size, though, you're throwing away a lot of numerical accuracy by doing this.
To figure out what to do with einsum, write everything in terms of components. I'll go straight to the bonus case, writing S_j^-1 = T_j for notational convenience:
D_{ij} = M^2(x_i, u_j; S_j)
= (x_i - u_j)^T T_j (x_i - u_j)
= \sum_k (x_i - u_j)_k ( T_j (x_i - u_j) )_k
= \sum_k (x_i - u_j)_k \sum_l (T_j)_{k l} (x_i - u_j)_l
= \sum_{k l} (X_{i k} - U_{j k}) (T_j)_{k l} (X_{i l} - U_{j l})
So, if we make arrays X of shape (m, n), U of shape (k, n), and T of shape (k, n, n), then we can write this as
diff = X[np.newaxis, :, :] - U[:, np.newaxis, :]
D = np.einsum('jik,jkl,jil->ij', diff, T, diff)
where diff[j, i, k] = X_[i, k] - U[j, k].
Dougal nailed this one with an excellent and detailed answer, but thought I'd share a small modification that I found increases efficiency in case anyone else is trying to implement this. Straight to the point:
Dougal's method was as follows:
def mahalanobis2(X, mu, sigma):
L = np.linalg.cholesky(sigma)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis,:]).T, lower=True)
return np.einsum('ij,ij->j', y, y)
A mathematically equivalent variant I tried is
def mahalanobis2_2(X, mu, sigma):
# Cholesky decomposition of inverse of covariance matrix
# (Doing this in either order should be equivalent)
linv = np.linalg.cholesky(np.linalg.inv(sigma))
# Just do regular matrix multiplication with this matrix
y = (X - mu[np.newaxis,:]).dot(linv)
# Same as above, but note different index at end because the matrix
# y is transposed here compared to above
return np.einsum('ij,ij->i', y, y)
Ran both versions head-to-head 20x using identical random inputs and recorded the times (in milliseconds). For X as a 1,000,000 x 3 matrix (mu and sigma 3 and 3x3) I get:
Method 1 (min/max/avg): 30/62/49
Method 2 (min/max/avg): 30/47/37
That's about a 30% speedup for the 2nd version. I'm mostly going to be running this in 3 or 4 dimensions but to see how it scaled I tried X as 1,000,000 x 100 and got:
Method 1 (min/max/avg): 970/1134/1043
Method 2 (min/max/avg): 776/907/837
which is about the same improvement.
I mentioned this in a comment on Dougal's answer but adding here for additional visibility:
The first pair of methods above take a single center point mu and covariance matrix sigma and calculate the squared Mahalanobis distance to each row of X. My bonus question was to do this multiple times with many sets of mu and sigma and output a two-dimensional matrix. The set of methods above can be used to accomplish this with a simple for loop, but Dougal also posted a more clever example using einsum.
I decided to compare these methods with each other by using them to solve the following problem: Given k d-dimensional normal distributions (with centers stored in rows of k by d matrix U and covariance matrices in the last two dimensions of the k by d by d array S), find the density at the n points stored in rows of the n by d matrix X.
The density of a multivariate normal distribution is a function of the squared Mahalanobis distance of the point to the mean. Scipy has an implementation of this as scipy.stats.multivariate_normal.pdf to use as a reference. I ran all three methods against each other 10x using identical random parameters each time, with d=3, k=96, n=5e5. Here are the results, in points/sec:
[Method]: (min/max/avg)
Scipy: 1.18e5/1.29e5/1.22e5
Fancy 1: 1.41e5/1.53e5/1.48e5
Fancy 2: 8.69e4/9.73e4/9.03e4
Fancy 2 (cheating version): 8.61e4/9.88e4/9.04e4
where Fancy 1 is the better of the two methods above and Fancy2 is Dougal's 2nd solution. Since the Fancy 2 needs to calculate the inverses of all the covariance matrices I also tried a "cheating version" where it was passed these as a parameter, but it looks like that didn't make a difference. I had planned on including the non-vectorized implementation but that was so slow it would have taken all day.
What we can take away from this is that using Dougal's first method is about 20% faster than however Scipy does it. Unfortunately despite its cleverness the 2nd method is only about 60% as fast as the first. There are probably some other optimizations that can be done but this is already fast enough for me.
I also tested how this scaled with higher dimensionality. With d=100, k=96, n=1e4:
Scipy: 7.81e3/7.91e3/7.86e3
Fancy 1: 1.03e4/1.15e4/1.08e4
Fancy 2: 3.75e3/4.10e3/3.95e3
Fancy 2 (cheating version): 3.58e3/4.09e3/3.85e3
Fancy 1 seems to have an even bigger advantage this time. Also worth noting that Scipy threw a LinAlgError 8/10 times, probably because some of my randomly-generated 100x100 covariance matrices were close to singular (which may mean that the other two methods are not as numerically stable, I did not actually check the results).

Resources