Related
I'm doing a Gaussian Process Simulation. I have x and y. I want to divide 85% of them in training and 15% in testing and then model fitting them to predict. How should I write the code? I know in Python the function I use is train_test_split().
x=rand(100)
dis = [abs(i-j) for i in x, j in x]
exp(-dis)
σ2= 1
g = 1
l = Matrix(I,100,100)
μ = zeros(100)
Σ = (σ2*exp(-dis/g))+0.1l
y = MvNormal(μ,Σ)
Y = rand(y,100)
Use the partition function from MLJ:
using MLJ
MLJ.partition((x, Y), 0.85, multi=true)
Here is its documentation https://alan-turing-institute.github.io/MLJ.jl/dev/preparing_data/#Splitting-data.
You can use TrainTestSplit from package Lathe
using Lathe.preprocess: TrainTestSplit
traindf, testdf = TrainTestSplit(df,.85);
Check this link for more: https://github.com/emmettgb/Lathe-Books/tree/main/preprocess
Or, still use partition from BetaML:
using BetaML
((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.85,0.15])
This generalise to n arrays (e.g. train/val/test) and you can also choose the dimension to where to partition and if randomise or not the partition.
I had an application that required something similar to the problem described here.
I too need to generate a set of positive integer random variables {Xi} that add up to a given sum S, where each variable might have constraints such as mi<=Xi<=Mi.
This I know how to do, the problem is that in my case I also might have constraints between the random variables themselves, say Xi<=Fi(Xj) for some given Fi (also lets say Fi's inverse is known), Now, how should one generate the random variables "correctly"? I put correctly in quotes here because I'm not really sure what it would mean here except that I want the generated numbers to cover all possible cases with as uniform a probability as possible for each possible case.
Say we even look at a very simple case:
4 random variables X1,X2,X3,X4 that need to add up to 100 and comply with the constraint X1 <= 2*X2, what would be the "correct" way to generate them?
P.S. I know that this seems like it would be a better fit for math overflow but I found no solutions there either.
For 4 random variables X1,X2,X3,X4 that need to add up to 100 and comply with the constraint X1 <= 2*X2, one could use multinomial distribution
As soon as probability of the first number is low enough, your
condition would be almost always satisfied, if not - reject and repeat.
And multinomial distribution by design has the sum equal to 100.
Code, Windows 10 x64, Python 3.8
import numpy as np
def x1x2x3x4(rng):
while True:
v = rng.multinomial(100, [0.1, 1/2-0.1, 1/4, 1/4])
if v[0] <= 2*v[1]:
return v
return None
rng = np.random.default_rng()
print(x1x2x3x4(rng))
print(x1x2x3x4(rng))
print(x1x2x3x4(rng))
UPDATE
Lots of freedom in selecting probabilities. E.g., you could make other (##2, 3, 4) symmetric. Code
def x1x2x3x4(rng, pfirst = 0.1):
pother = (1.0 - pfirst)/3.0
while True:
v = rng.multinomial(100, [pfirst, pother, pother, pother])
if v[0] <= 2*v[1]:
return v
return None
UPDATE II
If you start rejecting combinations, then you artificially bump probabilities of one subset of events and lower probabilities of another set of events - and total sum is always 1. There is NO WAY to have uniform probabilities with conditions you want to meet. Code below runs with multinomial with equal probabilities and computes histograms and mean values. Mean supposed to be exactly 25 (=100/4), but as soon as you reject some samples, you lower mean of first value and increase mean of the second value. Difference is small, but UNAVOIDABLE. If it is ok with you, so be it. Code
import numpy as np
import matplotlib.pyplot as plt
def x1x2x3x4(rng, summa, pfirst = 0.1):
pother = (1.0 - pfirst)/3.0
while True:
v = rng.multinomial(summa, [pfirst, pother, pother, pother])
if v[0] <= 2*v[1]:
return v
return None
rng = np.random.default_rng()
s = 100
N = 5000000
# histograms
first = np.zeros(s+1)
secnd = np.zeros(s+1)
third = np.zeros(s+1)
forth = np.zeros(s+1)
mfirst = np.float64(0.0)
msecnd = np.float64(0.0)
mthird = np.float64(0.0)
mforth = np.float64(0.0)
for _ in range(0, N): # sampling with equal probabilities
v = x1x2x3x4(rng, s, 0.25)
q = v[0]
mfirst += np.float64(q)
first[q] += 1.0
q = v[1]
msecnd += np.float64(q)
secnd[q] += 1.0
q = v[2]
mthird += np.float64(q)
third[q] += 1.0
q = v[3]
mforth += np.float64(q)
forth[q] += 1.0
x = np.arange(0, s+1, dtype=np.int32)
fig, axs = plt.subplots(4)
axs[0].stem(x, first, markerfmt=' ')
axs[1].stem(x, secnd, markerfmt=' ')
axs[2].stem(x, third, markerfmt=' ')
axs[3].stem(x, forth, markerfmt=' ')
plt.show()
print((mfirst/N, msecnd/N, mthird/N, mforth/N))
prints
(24.9267492, 25.0858356, 24.9928602, 24.994555)
NB! As I said, first mean is lower and second is higher. Histograms are a little bit different as well
UPDATE III
Ok, Dirichlet, so be it. Lets compute mean values of your generator before and after the filter. Code
import numpy as np
def generate(n=10000):
uv = np.hstack([np.zeros([n, 1]),
np.sort(np.random.rand(n, 2), axis=1),
np.ones([n,1])])
return np.diff(uv, axis=1)
a = generate(1000000)
print("Original Dirichlet sample means")
print(a.shape)
print(np.mean((a[:, 0] * 100).astype(int)))
print(np.mean((a[:, 1] * 100).astype(int)))
print(np.mean((a[:, 2] * 100).astype(int)))
print("\nFiltered Dirichlet sample means")
q = (a[(a[:,0]<=2*a[:,1]) & (a[:,2]>0.35),:] * 100).astype(int)
print(q.shape)
print(np.mean(q[:, 0]))
print(np.mean(q[:, 1]))
print(np.mean(q[:, 2]))
I've got
Original Dirichlet sample means
(1000000, 3)
32.833758
32.791228
32.88054
Filtered Dirichlet sample means
(281428, 3)
13.912784086871243
28.36360987535
56.23109285501087
Do you see the difference? As soon as you apply any kind of filter, you alter the distribution. Nothing is uniform anymore
Ok, so I have this solution for my actual question where I generate 9000 triplets of 3 random variables by joining zeros to sorted random tuple arrays and finally ones and then taking their differences as suggested in the answer on SO I mentioned in my original question.
Then I simply filter out the ones that don't match my constraints and plot them.
S = 100
def generate(n=9000):
uv = np.hstack([np.zeros([n, 1]),
np.sort(np.random.rand(n, 2), axis=1),
np.ones([n,1])])
return np.diff(uv, axis=1)
a = generate()
def plotter(a):
fig = plt.figure(figsize=(10, 10), dpi=100)
ax = fig.add_subplot(projection='3d')
surf = ax.scatter(*zip(*a), marker='o', color=a / 100)
ax.view_init(elev=25., azim=75)
ax.set_xlabel('$A_1$', fontsize='large', fontweight='bold')
ax.set_ylabel('$A_2$', fontsize='large', fontweight='bold')
ax.set_zlabel('$A_3$', fontsize='large', fontweight='bold')
lim = (0, S);
ax.set_xlim3d(*lim);
ax.set_ylim3d(*lim);
ax.set_zlim3d(*lim)
plt.show()
b = a[(a[:, 0] <= 3.5 * a[:, 1] + 2 * a[:, 2]) &\
(a[:, 1] >= (a[:, 2])),:] * S
plotter(b.astype(int))
As you can see, the distribution is uniformly distributed over these arbitrary limits on the simplex but I'm still not sure if I could forego throwing away samples that don't adhere to the constraints (work the constraints somehow into the generation process? I'm almost certain now that it can't be done for general {Fi}). This could be useful in the general case if your constraints limit your sampled area to a very small subarea of the entire simplex (since resampling like this means that to sample from the constrained area a you need to sample from the simplex an order of 1/a times).
If someone has an answer to this last question I will be much obliged (will change the selected answer to his).
I have an answer to my question, under a general set of constraints what I do is:
Sample the constraints in order to evaluate s, the constrained area.
If s is big enough then generate random samples and throw out those that do not comply to the constraints as described in my previous answer.
Otherwise:
Enumerate the entire simplex.
Apply the constraints to filter out all tuples outside the constrained area.
List the resulting filtered tuples.
When asked to generate, I generate by choosing uniformly from this result list.
(note: this is worth my effort only because I'm asked to generate very often)
A combination of these two strategies should cover most cases.
Note: I also had to handle cases where S was a randomly generated parameter (m < S < M) in which case I simply treat it as another random variable constrained between m and M and I generate it together with the rest of the variables and handle it as I described earlier.
I am trying to solve the following optimization problem using cvxpy:
x and delta_x are (1,N) row vectors. A is a (N,N) symmetric matrix and b is a scalar. I am trying to find a y, such that it minimizes the sum of squares of (y - delta_x) with the constraint (x+y).A.(x+y).T - b = 0. Below is my attempt to solve it.
x = np.reshape(np.ravel(x_data.T), (1, -1))
delta_x = np.reshape(np.ravel(delta.T), (1, -1))
y = cp.Variable(delta_x.shape)
objective = cp.Minimize(cp.sum_squares(y - delta_x))
constraints = [cp.matmul(cp.matmul(x + y, A), (x + y).T) == (b*b)]
prob = cp.Problem(objective, constraints)
result = prob.solve()
I keep getting the error 'cvxpy.error.DCPError: Problem does not follow DCP rules'.
I followed the rules stated in the answer here, but I don't understand how to construct the proper cvxpy minimization Problem. Any help would be greatly appreciated.
Thanks!
I'm aware of the sum_expr function in the ompr package as a way to create a constraint with a dynamic sum. However, I'm wondering if there's a way to create a constraint that uses the product instead of the sum. Or is this is not possible in linear optimisation?
For example:
library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
n <- 20
score <- round(runif(n, 0, 25))
penalties <- round(runif(n, 0, 25))
model <- MIPModel() %>%
add_variable(x[i], i = 1:n, type = "binary") %>%
set_objective(sum_expr(score[i] * x[i], i = 1:n), "max") %>%
add_constraint(sum_expr(penalties[i] * x[i], i = 1:n) <= 100)
result <- solve_model(model, with_ROI(solver = "glpk", verbose = TRUE))
result$solution
Instead of add_constraint(sum_expr()), is there a way of doing add_constraint(product_expr())?
If it's not possible with linear optimisation, where should I be looking instead?
The product of binary variables can be linearized as follows.
Suppose we want to model
y = prod(i, x(i))
x(i), y ∈ {0,1}
We can write this as a set of linear inequalities:
y ≤ x(i) ∀i
y ≥ sum(i, x(i)) - card(i) + 1
x(i), y ∈ {0,1}
where card(i) is the number of i's. Often things can be simplified further, but that depends on the details of the model.
This can be implemented straightforwardly in OMPR, and can be solved with any linear MIP solver.
I managed to find an answer to my original question. In case of use to anyone the constraint I needed was:
add_constraint(sum_expr(x[i] * log(penalties[i]), i = 1:n) >= log(100))
So in words, I'm summing the log transformed penalty values (for which x[i] = 1) against the log transformed penalty total, to mimic a product constraint.
My original question erroneously implied something that likely misled readers. I was looking for the product of all penalties for which x[i] = 1. Not the product of the values (penalties[i] * x[i]) which as soon as any x[i] = 0 becomes 0.
EDIT 2: this post seems to have been moved from CrossValidated to StackOverflow due to it being mostly about programming, but that means by fancy MathJax doesn't work anymore. Hopefully this is still readable.
Say I want to to calculate the squared Mahalanobis distance between two vectors x and y with covariance matrix S. This is a fairly simple function defined by
M2(x, y; S) = (x - y)^T * S^-1 * (x - y)
With python's numpy package I can do this as
# x, y = numpy.ndarray of shape (n,)
# s_inv = numpy.ndarray of shape (n, n)
diff = x - y
d2 = diff.T.dot(s_inv).dot(diff)
or in R as
diff <- x - y
d2 <- t(diff) %*% s_inv %*% diff
In my case, though, I am given
m by n matrix X
n-dimensional vector mu
n by n covariance matrix S
and want to find the m-dimensional vector d such that
d_i = M2(x_i, mu; S) ( i = 1 .. m )
where x_i is the ith row of X.
This is not difficult to accomplish using a simple loop in python:
d = numpy.zeros((m,))
for i in range(m):
diff = x[i,:] - mu
d[i] = diff.T.dot(s_inv).dot(diff)
Of course, given that the outer loop is happening in python instead of in native code in the numpy library means it's not as fast as it could be. $n$ and $m$ are about 3-4 and several hundred thousand respectively and I'm doing this somewhat often in an interactive program so a speedup would be very useful.
Mathematically, the only way I've been able to formulate this using basic matrix operations is
d = diag( X' * S^-1 * X'^T )
where
x'_i = x_i - mu
which is simple to write a vectorized version of, but this is unfortunately outweighed by the inefficiency of calculating a 10-billion-plus element matrix and only taking the diagonal... I believe this operation should be easily expressible using Einstein notation, and thus could hopefully be evaluated quickly with numpy's einsum function, but I haven't even begun to figure out how that black magic works.
So, I would like to know: is there either a nicer way to formulate this operation mathematically (in terms of simple matrix operations), or could someone suggest some nice vectorized (python or R) code that does this efficiently?
BONUS QUESTION, for the brave
I don't actually want to do this once, I want to do it k ~ 100 times. Given:
m by n matrix X
k by n matrix U
Set of n by n covariance matrices each denoted S_j (j = 1..k)
Find the m by k matrix D such that
D_i,j = M(x_i, u_j; S_j)
Where i = 1..m, j = 1..k, x_i is the ith row of X and u_j is the jth row of U.
I.e., vectorize the following code:
# s_inv is (k x n x n) array containing "stacked" inverses
# of covariance matrices
d = numpy.zeros( (m, k) )
for j in range(k):
for i in range(m):
diff = x[i, :] - u[j, :]
d[i, j] = diff.T.dot(s_inv[j, :, :]).dot(diff)
First off, it seems like maybe you're getting S and then inverting it. You shouldn't do that; it's slow and numerically inaccurate. Instead, you should get the Cholesky factor L of S so that S = L L^T; then
M^2(x, y; L L^T)
= (x - y)^T (L L^T)^-1 (x - y)
= (x - y)^T L^-T L^-1 (x - y)
= || L^-1 (x - y) ||^2,
and since L is triangular L^-1 (x - y) can be computed efficiently.
As it turns out, scipy.linalg.solve_triangular will happily do a bunch of these at once if you reshape it properly:
L = np.linalg.cholesky(S)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis]).T, lower=True)
d = np.einsum('ij,ij->j', y, y)
Breaking that down a bit, y[i, j] is the ith component of L^-1 (X_j - \mu). The einsum call then does
d_j = \sum_i y_{ij} y_{ij}
= \sum_i y_{ij}^2
= || y_j ||^2,
like we need.
Unfortunately, solve_triangular won't vectorize across its first argument, so you should probably just loop there. If k is only about 100, that's not going to be a significant issue.
If you are actually given S^-1 rather than S, then you can indeed do this with einsum more directly. Since S is quite small in your case, it's also possible that actually inverting the matrix and then doing this would be faster. As soon as n is a nontrivial size, though, you're throwing away a lot of numerical accuracy by doing this.
To figure out what to do with einsum, write everything in terms of components. I'll go straight to the bonus case, writing S_j^-1 = T_j for notational convenience:
D_{ij} = M^2(x_i, u_j; S_j)
= (x_i - u_j)^T T_j (x_i - u_j)
= \sum_k (x_i - u_j)_k ( T_j (x_i - u_j) )_k
= \sum_k (x_i - u_j)_k \sum_l (T_j)_{k l} (x_i - u_j)_l
= \sum_{k l} (X_{i k} - U_{j k}) (T_j)_{k l} (X_{i l} - U_{j l})
So, if we make arrays X of shape (m, n), U of shape (k, n), and T of shape (k, n, n), then we can write this as
diff = X[np.newaxis, :, :] - U[:, np.newaxis, :]
D = np.einsum('jik,jkl,jil->ij', diff, T, diff)
where diff[j, i, k] = X_[i, k] - U[j, k].
Dougal nailed this one with an excellent and detailed answer, but thought I'd share a small modification that I found increases efficiency in case anyone else is trying to implement this. Straight to the point:
Dougal's method was as follows:
def mahalanobis2(X, mu, sigma):
L = np.linalg.cholesky(sigma)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis,:]).T, lower=True)
return np.einsum('ij,ij->j', y, y)
A mathematically equivalent variant I tried is
def mahalanobis2_2(X, mu, sigma):
# Cholesky decomposition of inverse of covariance matrix
# (Doing this in either order should be equivalent)
linv = np.linalg.cholesky(np.linalg.inv(sigma))
# Just do regular matrix multiplication with this matrix
y = (X - mu[np.newaxis,:]).dot(linv)
# Same as above, but note different index at end because the matrix
# y is transposed here compared to above
return np.einsum('ij,ij->i', y, y)
Ran both versions head-to-head 20x using identical random inputs and recorded the times (in milliseconds). For X as a 1,000,000 x 3 matrix (mu and sigma 3 and 3x3) I get:
Method 1 (min/max/avg): 30/62/49
Method 2 (min/max/avg): 30/47/37
That's about a 30% speedup for the 2nd version. I'm mostly going to be running this in 3 or 4 dimensions but to see how it scaled I tried X as 1,000,000 x 100 and got:
Method 1 (min/max/avg): 970/1134/1043
Method 2 (min/max/avg): 776/907/837
which is about the same improvement.
I mentioned this in a comment on Dougal's answer but adding here for additional visibility:
The first pair of methods above take a single center point mu and covariance matrix sigma and calculate the squared Mahalanobis distance to each row of X. My bonus question was to do this multiple times with many sets of mu and sigma and output a two-dimensional matrix. The set of methods above can be used to accomplish this with a simple for loop, but Dougal also posted a more clever example using einsum.
I decided to compare these methods with each other by using them to solve the following problem: Given k d-dimensional normal distributions (with centers stored in rows of k by d matrix U and covariance matrices in the last two dimensions of the k by d by d array S), find the density at the n points stored in rows of the n by d matrix X.
The density of a multivariate normal distribution is a function of the squared Mahalanobis distance of the point to the mean. Scipy has an implementation of this as scipy.stats.multivariate_normal.pdf to use as a reference. I ran all three methods against each other 10x using identical random parameters each time, with d=3, k=96, n=5e5. Here are the results, in points/sec:
[Method]: (min/max/avg)
Scipy: 1.18e5/1.29e5/1.22e5
Fancy 1: 1.41e5/1.53e5/1.48e5
Fancy 2: 8.69e4/9.73e4/9.03e4
Fancy 2 (cheating version): 8.61e4/9.88e4/9.04e4
where Fancy 1 is the better of the two methods above and Fancy2 is Dougal's 2nd solution. Since the Fancy 2 needs to calculate the inverses of all the covariance matrices I also tried a "cheating version" where it was passed these as a parameter, but it looks like that didn't make a difference. I had planned on including the non-vectorized implementation but that was so slow it would have taken all day.
What we can take away from this is that using Dougal's first method is about 20% faster than however Scipy does it. Unfortunately despite its cleverness the 2nd method is only about 60% as fast as the first. There are probably some other optimizations that can be done but this is already fast enough for me.
I also tested how this scaled with higher dimensionality. With d=100, k=96, n=1e4:
Scipy: 7.81e3/7.91e3/7.86e3
Fancy 1: 1.03e4/1.15e4/1.08e4
Fancy 2: 3.75e3/4.10e3/3.95e3
Fancy 2 (cheating version): 3.58e3/4.09e3/3.85e3
Fancy 1 seems to have an even bigger advantage this time. Also worth noting that Scipy threw a LinAlgError 8/10 times, probably because some of my randomly-generated 100x100 covariance matrices were close to singular (which may mean that the other two methods are not as numerically stable, I did not actually check the results).