Matrix factorization with ALS has very slow performance with implicit zeros for large matrices in R

Matrix factorization with ALS has very slow performance with implicit zeros for large matrices in R - r

I want to perform a Matrix factorization with alternating least squares (ALS) in R. While the code is working fine for small matrices, it is incredible slow for larger matrices. I would appreciate any help in speeding up the process. I am using RRopen 8.01, therefore it is already running on multiple cores using MKL.
I am utilizing a binary matrix as implicit feedback matrix. Furthermore I implemented a weighting matrix.
## Matrix Factorization with Alternating Least Squares
## R is u * v binary matrix,
## W is u * v weighting matrix
## U is u * k user feature matrix,
## V is v * k item feature matrix
## u is the number of users,
## v is the number of items,
## k is the number of features
## iter is the number of iterations
Here is what I did:
# implicit feedback data matrix.
R <- matrix(nr=2, nc=5, data=rbinom(2*5,1, prob=.2))
W <- matrix(nr=2, nc=5, data=rbinom(2*5,7, prob=.2))
I set the following parameter:
k <- 20
its <- 10
Create the initial matrices for users and items
# initial users matrix.
U <- matrix(nr= nrow(R), nc=k, data=5 *rnorm(nrow(R)*k))
# initial items matrix.
V <- matrix(nr=k, nc=ncol(R), data=5* rnorm(ncol(R)*k))
And now I perform the Matrix Factorization with ALS
w.err <- NULL
for(iter in 1:its) {
# update users
for(i in 1:nrow(R)) {
U[i,] <- t(solve(V %*% (diag(R[i,])%*% t(V)) + 0.1 * diag(k),
as.vector(V %*% as.vector(t(W[i,])%*% diag(R[i,])))))
}
# update items
for(j in 1:ncol(R)){
V[,j] <- solve(t(U) %*% (diag(R[,j]) %*% U) + 0.1 * diag(k),
t(U) %*% (diag(R[,j]) %*% W[, j]))
}
R.hat <- U %*% V
w.err[iter] <- sum((R* (W-U%*%V))^2)
}
R.hat is the desired end matrix.
w.err is just a control for the errors over the iterations. Nice for plotting :)
The code as it is works fine. Just when I increase the number of rows and columns in R (and W), the performance decrease significantly. While it is fine for let's say nr=200, nr=500, it is already running for two hours for nr=2000, nr=5000(and not finished yet) on an 8 core 2.67 Ghz machine.
I didn't use the NMF or the NMFN package since negative values are possible, accordingly it is not an non-negative MF.
Does anyone has an idea how to increase performance? Maybe I am just stupid an my code is nonsense, i would be happy if you could point out improvements.
I looked for similar questions but couldn't find one. Maybe I just overlooked it.

Related

compact/efficient replacement for diag(X V X^T)?

When making predictions for a linear statistical model we usually have a model matrix X of predictors corresponding to the points at which we want to make predictions; a vector of coefficients beta; and a variance-covariance matrix V. Computing the predictions is just X %*% beta. The most straightforward way to compute the variances of the predictions is
diag(X %*% V %*% t(X))
or slightly more efficiently
diag(X %*% tcrossprod(V,X))
However, this is very inefficient, because it constructs an n*n matrix when all we really want is the diagonal. I know I could write some Rcpp-loopy thing that would compute just the diagonal terms, but I'm wondering if there is an existing linear algebra trick in R that will nicely do what I want ... (if someone wants to write the Rcpp-loopy thing for me as an answer I wouldn't object, but I'd prefer a pure-R solution)
FWIW predict.lm seems to do something clever by multiplying X by the inverse of the R component of the QR-decomposition of the lm; I'm not sure that's always going to be available, but it might be a good starting point (see here)

Along the lines of this Octave/Matlab question, for two matrices A and B, we can use the use the fact that the nth diagonal entry of AB will be the product of the nth row of A with the nth column of B. We can naively extend that to the case of three matrices, ABC. I have not considered how to optimize in the case where C=A^T, but aside from that, this code looks like promising speedup:
start_time <- Sys.time()
A=matrix(1:1000000, nrow = 1000, ncol = 1000)
B=matrix(1000000:1, nrow = 1000, ncol = 1000)
# Try one of these two
res=diag(A %*% B %*% t(A)) # ~0.47s
res=rowSums(A * t(B %*% t(A))) # ~0.27s
end_time <- Sys.time()
print(end_time - start_time)
Using tcrossprod did not appear to accelerate the results when I ran this code. However, just using the row-sum-dot-product approach appears to be a lot more efficient already, at least on this silly example, which suggests (though I'm not sure) that rowSums is not computing the full intermediate matrices before returning the diagonal entries, as I'd expect happens with diag.

I am not quite sure how efficient this is,
Find U such that V = U %*% t(U); this is possible since V is cov matrix.
XU = X %*% U
result = apply(XU, 1, function(x) sum(x^2))
Demo
V <- cov(iris[, -5])
X <- as.matrix(iris[1:5, -5])
Using SVD
svd_v <- svd(V)
U <- svd_v$u %*% diag(sqrt(svd_v$d))
XU = X %*% U
apply(XU, 1, function(x) sum(x^2))
# 1 2 3 4 5
#41.35342 39.36286 35.42369 38.25584 40.30839
Another approach - this isn't also going to be faster than #davewy's
U <- chol(V)
XU = (X %*% U)^2
rowSums(XU)

I recently found emulator::quad.diag(), which is just
colSums(crossprod(M, Conj(x)) * x)
This is slightly better than #davewy's solution (although the overall differences are less than I thought they would be anyway).
library(microbenchmark)
microbenchmark(full=diag(A %*% B %*% t(A)),
davewy=rowSums(A * t(B %*% t(A))),
emu = quad.diag(A,B))
Unit: milliseconds
expr min lq mean median uq max neval cld
full 32.76241 35.49665 39.51683 37.63958 41.46561 57.41370 100 c
davewy 22.74787 25.06874 28.42179 26.97330 29.68895 45.38188 100 b
emu 17.68390 20.21322 23.59981 22.09324 24.80734 43.60953 100 a

How do I minimize a linear least squares function in R?

I'm reading Deep Learning by Goodfellow et al. and am trying to implement gradient descent as shown in Section 4.5 Example: Linear Least Squares. This is page 92 in the hard copy of the book.
The algorithm can be viewed in detail at https://www.deeplearningbook.org/contents/numerical.html with R implementation of linear least squares on page 94.
I've tried implementing in R, and the algorithm as implemented converges on a vector, but this vector does not seem to minimize the least squares function as required. Adding epsilon to the vector in question frequently produces a "minimum" less than the minimum outputted by my program.
options(digits = 15)
dim_square = 2 ### set dimension of square matrix
# Generate random vector, random matrix, and
set.seed(1234)
A = matrix(nrow = dim_square, ncol = dim_square, byrow = T, rlnorm(dim_square ^ 2)/10)
b = rep(rnorm(1), dim_square)
# having fixed A & B, select X randomly
x = rnorm(dim_square) # vector length of dim_square--supposed to be arbitrary
f = function(x, A, b){
total_vector = A %*% x + b # this is the function that we want to minimize
total = 0.5 * sum(abs(total_vector) ^ 2) # L2 norm squared
return(total)
}
f(x,A,b)
# how close do we want to get?
epsilon = 0.1
delta = 0.01
value = (t(A) %*% A) %*% x - t(A) %*% b
L2_norm = (sum(abs(value) ^ 2)) ^ 0.5
steps = vector()
while(L2_norm > delta){
x = x - epsilon * value
value = (t(A) %*% A) %*% x - t(A) %*% b
L2_norm = (sum(abs(value) ^ 2)) ^ 0.5
print(L2_norm)
}
minimum = f(x, A, b)
minimum
minimum_minus = f(x - 0.5*epsilon, A, b)
minimum_minus # less than the minimum found by gradient descent! Why?
On page 94 of the pdf appearing at https://www.deeplearningbook.org/contents/numerical.html
I am trying to find the values of the vector x such that f(x) is minimized. However, as demonstrated by the minimum in my code, and minimum_minus, minimum is not the actual minimum, as it exceeds minimum minus.
Any idea what the problem might be?

Original Problem
Finding the value of x such that the quantity Ax - b is minimized is equivalent to finding the value of x such that Ax - b = 0, or x = (A^-1)*b. This is because the L2 norm is the euclidean norm, more commonly known as the distance formula. By definition, distance cannot be negative, making its minimum identically zero.
This algorithm, as implemented, actually comes quite close to estimating x. However, because of recursive subtraction and rounding one quickly runs into the problem of underflow, resulting in massive oscillation, below:
Value of L2 Norm as a function of step size
Above algorithm vs. solve function in R
Above we have the results of A %% x followed by A %% min_x, with x estimated by the implemented algorithm and min_x estimated by the solve function in R.
The problem of underflow, well known to those familiar with numerical analysis, is probably best tackled by the programmers of lower-level libraries best equipped to tackle it.
To summarize, the algorithm appears to work as implemented. Important to note, however, is that not every function will have a minimum (think of a straight line), and also be aware that this algorithm should only be able to find a local, as opposed to a global minimum.

Solve indeterminate equation system in R

I have a equation system and I want to solve it using numerical methods. I want to get a close solution given a starting seed. Let me explain.
I have a vector of constants ,X, of values:
X <- (c(1,-2,3,4))
and a vector W of weights:
W <- (c(0.25,0.25,0.25,0.25))
I want that the sum of the components of W will be (sum(W)=1), and the sum of the multiplication of X and W element by element will be a given number N (sum(W*X)=N).
Is there a easy way to do this in R? I have it in Excel, using Solver, but I need to automatize it.

Here is your constant and your target value:
x <- c(1, -2, 3, 4)
n <- 10
You need a function to minimize. The first line contains each of your conditions, and the second line provides a measure of how to combine the errors into a single score. You may want to change the second line. For example, you could make one error term be more heavily weighted than the other using sum(c(1, 5) * errs ^ 2).
fn <- function(w)
{
errs <- c(sum(w) - 1, sum(x * w) - n)
sum(errs ^ 2)
}
The simplest thing is to start with all the weights the same value.
init_w <- rep.int(1 / length(x), length(x))
Use optim to optimize.
optim(init_w, fn)
## $par
## [1] 0.1204827 -1.2438883 1.1023338 1.0212406
##
## $value
## [1] 7.807847e-08
##
## $counts
## function gradient
## 111 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
The par element contains your weights.

There is no unique solution for this problem. If you try other initial values for w you will most likely get different results from optim.
The problem can be formulated as solving an underdetermined system of linear equations.
A <- matrix(c(rep(1,4),x), nrow=2,byrow=TRUE)
b <- matrix(c(1,n), nrow=2)
We seek a solution that satisfies A %*% w = b but which one? Minimum norm solution? Or maybe some other one? There are infinitely many solutions. Solutions can be given using the pseudo-inverse of the matrix A. Use package MASS for this.
library(MASS)
Ag <- ginv(A)
The minimum norm solution is
wmnorm <- Ag %*% b
And check with A %*% wmnorm - b and fn(wmnorm).
See the Wikipedia page System of linear equations
the section Matrix solutions.
The solutions are given by
Az <- diag(nrow=nrow(Ag)) - Ag %*% A
w <- wmnorm + Az %*% z
where z is an arbitrary vector of ncol(Az) elements.
And now generate some solutions and check
xb <- wmnorm
z <- runif(4)
wsol.2 <- xb + Az %*% z
wsol.2
A %*% wsol.2 - b
fn(wsol.2)
z <- runif(4)
wsol.3 <- xb + Az %*% z
wsol.3
A %*% wsol.2 - b
fn(wsol.3)
And you'll see that these two solutions are valid solutions when given as argument to fn. And are quite different from the solution found by optim. You could test this by choosing a different starting point init_w for example by init_w1 <- runif(4)/4.

Vectorizing code to calculate (squared) Mahalanobis Distiance

EDIT 2: this post seems to have been moved from CrossValidated to StackOverflow due to it being mostly about programming, but that means by fancy MathJax doesn't work anymore. Hopefully this is still readable.
Say I want to to calculate the squared Mahalanobis distance between two vectors x and y with covariance matrix S. This is a fairly simple function defined by
M2(x, y; S) = (x - y)^T * S^-1 * (x - y)
With python's numpy package I can do this as
# x, y = numpy.ndarray of shape (n,)
# s_inv = numpy.ndarray of shape (n, n)
diff = x - y
d2 = diff.T.dot(s_inv).dot(diff)
or in R as
diff <- x - y
d2 <- t(diff) %*% s_inv %*% diff
In my case, though, I am given
m by n matrix X
n-dimensional vector mu
n by n covariance matrix S
and want to find the m-dimensional vector d such that
d_i = M2(x_i, mu; S) ( i = 1 .. m )
where x_i is the ith row of X.
This is not difficult to accomplish using a simple loop in python:
d = numpy.zeros((m,))
for i in range(m):
diff = x[i,:] - mu
d[i] = diff.T.dot(s_inv).dot(diff)
Of course, given that the outer loop is happening in python instead of in native code in the numpy library means it's not as fast as it could be. $n$ and $m$ are about 3-4 and several hundred thousand respectively and I'm doing this somewhat often in an interactive program so a speedup would be very useful.
Mathematically, the only way I've been able to formulate this using basic matrix operations is
d = diag( X' * S^-1 * X'^T )
where
x'_i = x_i - mu
which is simple to write a vectorized version of, but this is unfortunately outweighed by the inefficiency of calculating a 10-billion-plus element matrix and only taking the diagonal... I believe this operation should be easily expressible using Einstein notation, and thus could hopefully be evaluated quickly with numpy's einsum function, but I haven't even begun to figure out how that black magic works.
So, I would like to know: is there either a nicer way to formulate this operation mathematically (in terms of simple matrix operations), or could someone suggest some nice vectorized (python or R) code that does this efficiently?
BONUS QUESTION, for the brave
I don't actually want to do this once, I want to do it k ~ 100 times. Given:
m by n matrix X
k by n matrix U
Set of n by n covariance matrices each denoted S_j (j = 1..k)
Find the m by k matrix D such that
D_i,j = M(x_i, u_j; S_j)
Where i = 1..m, j = 1..k, x_i is the ith row of X and u_j is the jth row of U.
I.e., vectorize the following code:
# s_inv is (k x n x n) array containing "stacked" inverses
# of covariance matrices
d = numpy.zeros( (m, k) )
for j in range(k):
for i in range(m):
diff = x[i, :] - u[j, :]
d[i, j] = diff.T.dot(s_inv[j, :, :]).dot(diff)

First off, it seems like maybe you're getting S and then inverting it. You shouldn't do that; it's slow and numerically inaccurate. Instead, you should get the Cholesky factor L of S so that S = L L^T; then
M^2(x, y; L L^T)
= (x - y)^T (L L^T)^-1 (x - y)
= (x - y)^T L^-T L^-1 (x - y)
= || L^-1 (x - y) ||^2,
and since L is triangular L^-1 (x - y) can be computed efficiently.
As it turns out, scipy.linalg.solve_triangular will happily do a bunch of these at once if you reshape it properly:
L = np.linalg.cholesky(S)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis]).T, lower=True)
d = np.einsum('ij,ij->j', y, y)
Breaking that down a bit, y[i, j] is the ith component of L^-1 (X_j - \mu). The einsum call then does
d_j = \sum_i y_{ij} y_{ij}
= \sum_i y_{ij}^2
= || y_j ||^2,
like we need.
Unfortunately, solve_triangular won't vectorize across its first argument, so you should probably just loop there. If k is only about 100, that's not going to be a significant issue.
If you are actually given S^-1 rather than S, then you can indeed do this with einsum more directly. Since S is quite small in your case, it's also possible that actually inverting the matrix and then doing this would be faster. As soon as n is a nontrivial size, though, you're throwing away a lot of numerical accuracy by doing this.
To figure out what to do with einsum, write everything in terms of components. I'll go straight to the bonus case, writing S_j^-1 = T_j for notational convenience:
D_{ij} = M^2(x_i, u_j; S_j)
= (x_i - u_j)^T T_j (x_i - u_j)
= \sum_k (x_i - u_j)_k ( T_j (x_i - u_j) )_k
= \sum_k (x_i - u_j)_k \sum_l (T_j)_{k l} (x_i - u_j)_l
= \sum_{k l} (X_{i k} - U_{j k}) (T_j)_{k l} (X_{i l} - U_{j l})
So, if we make arrays X of shape (m, n), U of shape (k, n), and T of shape (k, n, n), then we can write this as
diff = X[np.newaxis, :, :] - U[:, np.newaxis, :]
D = np.einsum('jik,jkl,jil->ij', diff, T, diff)
where diff[j, i, k] = X_[i, k] - U[j, k].

Dougal nailed this one with an excellent and detailed answer, but thought I'd share a small modification that I found increases efficiency in case anyone else is trying to implement this. Straight to the point:
Dougal's method was as follows:
def mahalanobis2(X, mu, sigma):
L = np.linalg.cholesky(sigma)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis,:]).T, lower=True)
return np.einsum('ij,ij->j', y, y)
A mathematically equivalent variant I tried is
def mahalanobis2_2(X, mu, sigma):
# Cholesky decomposition of inverse of covariance matrix
# (Doing this in either order should be equivalent)
linv = np.linalg.cholesky(np.linalg.inv(sigma))
# Just do regular matrix multiplication with this matrix
y = (X - mu[np.newaxis,:]).dot(linv)
# Same as above, but note different index at end because the matrix
# y is transposed here compared to above
return np.einsum('ij,ij->i', y, y)
Ran both versions head-to-head 20x using identical random inputs and recorded the times (in milliseconds). For X as a 1,000,000 x 3 matrix (mu and sigma 3 and 3x3) I get:
Method 1 (min/max/avg): 30/62/49
Method 2 (min/max/avg): 30/47/37
That's about a 30% speedup for the 2nd version. I'm mostly going to be running this in 3 or 4 dimensions but to see how it scaled I tried X as 1,000,000 x 100 and got:
Method 1 (min/max/avg): 970/1134/1043
Method 2 (min/max/avg): 776/907/837
which is about the same improvement.
I mentioned this in a comment on Dougal's answer but adding here for additional visibility:
The first pair of methods above take a single center point mu and covariance matrix sigma and calculate the squared Mahalanobis distance to each row of X. My bonus question was to do this multiple times with many sets of mu and sigma and output a two-dimensional matrix. The set of methods above can be used to accomplish this with a simple for loop, but Dougal also posted a more clever example using einsum.
I decided to compare these methods with each other by using them to solve the following problem: Given k d-dimensional normal distributions (with centers stored in rows of k by d matrix U and covariance matrices in the last two dimensions of the k by d by d array S), find the density at the n points stored in rows of the n by d matrix X.
The density of a multivariate normal distribution is a function of the squared Mahalanobis distance of the point to the mean. Scipy has an implementation of this as scipy.stats.multivariate_normal.pdf to use as a reference. I ran all three methods against each other 10x using identical random parameters each time, with d=3, k=96, n=5e5. Here are the results, in points/sec:
[Method]: (min/max/avg)
Scipy: 1.18e5/1.29e5/1.22e5
Fancy 1: 1.41e5/1.53e5/1.48e5
Fancy 2: 8.69e4/9.73e4/9.03e4
Fancy 2 (cheating version): 8.61e4/9.88e4/9.04e4
where Fancy 1 is the better of the two methods above and Fancy2 is Dougal's 2nd solution. Since the Fancy 2 needs to calculate the inverses of all the covariance matrices I also tried a "cheating version" where it was passed these as a parameter, but it looks like that didn't make a difference. I had planned on including the non-vectorized implementation but that was so slow it would have taken all day.
What we can take away from this is that using Dougal's first method is about 20% faster than however Scipy does it. Unfortunately despite its cleverness the 2nd method is only about 60% as fast as the first. There are probably some other optimizations that can be done but this is already fast enough for me.
I also tested how this scaled with higher dimensionality. With d=100, k=96, n=1e4:
Scipy: 7.81e3/7.91e3/7.86e3
Fancy 1: 1.03e4/1.15e4/1.08e4
Fancy 2: 3.75e3/4.10e3/3.95e3
Fancy 2 (cheating version): 3.58e3/4.09e3/3.85e3
Fancy 1 seems to have an even bigger advantage this time. Also worth noting that Scipy threw a LinAlgError 8/10 times, probably because some of my randomly-generated 100x100 covariance matrices were close to singular (which may mean that the other two methods are not as numerically stable, I did not actually check the results).

Vectorize function to avoid loop

I'm trying to speed up my code because it's running very long. I already found out where the problem lies. Consider the following example:
x<-c((2+2i),(3+1i),(4+1i),(5+3i),(6+2i),(7+2i))
P<-matrix(c(2,0,0,3),nrow=2)
out<-sum(c(0.5,0.5)%*%mtx.exp(P%*%(matrix(c(x,0,0,x),nrow=2)),5))
I have a vector x with complex values, the vector has 12^11 entries and then I want to calculate the sum in the third row. (I need the function mtx.exp because it's a complex matrix power (the function is in the package Biodem). I found out that the %^% function does not support complex arguments.)
So my problem is that if I try
sum(c(0.5,0.5)%*%mtx.exp(P%*%(matrix(c(x,0,0,x),nrow=2)),5))
I get an error: "Error in pot %*% pot : non-conformable arguments." So my solution was to use a loop:
tmp<-NULL
for (i in 1:length(x)){
tmp[length(tmp)+1]<-sum(c(0.5,0.5)%*%mtx.exp(P%*%matrix(c(x[i],0,0,x[i]),nrow=2),5))
}
But as said, this takes very long. Do you have any ideas how to speed up the code? I also tried sapply but that takes just as long as the loop.
I hope you can help me, because i have to run this function approximatly 500 times and this took in first try more than 3 hours. Which is not very satisfying..
Thank u very much

The code can be sped up by pre-allocating your vector,
tmp <- rep(NA,length(x))
but I do not really understand what you are trying to compute:
in the first example,
you are trying to take the power of a non-square matrix,
in the second, you are taking the power of a diagonal matrix
(which can be done with ^).
The following seems to be equivalent to your computations:
sum(P^5/2) * x^5
EDIT
If P is not diagonal and C not scalar,
I do not see any easy simplification of mtx.exp( P %*% C, 5 ).
You could try something like
y <- sapply(x, function(u)
sum(
c(0.5,0.5)
%*%
mtx.exp( P %*% matrix(c(u,0,0,u),nrow=2), 5 )
)
)
but if your vector really has 12^11 entries,
that will take an insanely long time.
Alternatively, since you have a very large number
of very small (2*2) matrices,
you can explicitely compute the product P %*% C
and its 5th power (using some computer algebra system:
Maxima, Sage, Yacas, Maple, etc.)
and use the resulting formulas:
these are just (50 lines of) straightforward operations on vectors.
/* Maxima code */
p: matrix([p11,p12], [p21,p22]);
c: matrix([c1,0],[0,c2]);
display2d: false;
factor(p.c . p.c . p.c . p.c . p.c);
I then copy and paste the result in R:
c1 <- dnorm(abs(x),0,1); # C is still a diagonal matrix
c2 <- dnorm(abs(x),1,3);
p11 <- P[1,1]
p12 <- P[1,2]
p21 <- P[2,1]
p22 <- P[2,2]
# Result of the Maxima computations:
# I just add all the elements of the resulting 2*2 matrix,
# but you may want to do something slightly different with them.
c1*(c2^4*p12*p21*p22^3+2*c1*c2^3*p11*p12*p21*p22^2
+2*c1*c2^3*p12^2*p21^2*p22
+3*c1^2*c2^2*p11^2*p12*p21*p22
+3*c1^2*c2^2*p11*p12^2*p21^2
+4*c1^3*c2*p11^3*p12*p21+c1^4*p11^5)
+
c2*p12
*(c2^4*p22^4+c1*c2^3*p11*p22^3+3*c1*c2^3*p12*p21*p22^2
+c1^2*c2^2*p11^2*p22^2+4*c1^2*c2^2*p11*p12*p21*p22
+c1^3*c2*p11^3*p22+c1^2*c2^2*p12^2*p21^2
+3*c1^3*c2*p11^2*p12*p21+c1^4*p11^4)
+
c1*p21
*(c2^4*p22^4+c1*c2^3*p11*p22^3+3*c1*c2^3*p12*p21*p22^2
+c1^2*c2^2*p11^2*p22^2+4*c1^2*c2^2*p11*p12*p21*p22
+c1^3*c2*p11^3*p22+c1^2*c2^2*p12^2*p21^2
+3*c1^3*c2*p11^2*p12*p21+c1^4*p11^4)
+
c2*(c2^4*p22^5+4*c1*c2^3*p12*p21*p22^3
+3*c1^2*c2^2*p11*p12*p21*p22^2
+3*c1^2*c2^2*p12^2*p21^2*p22
+2*c1^3*c2*p11^2*p12*p21*p22
+2*c1^3*c2*p11*p12^2*p21^2+c1^4*p11^3*p12*p21)