Interpretation/Mechanics of eigenvectors/eigenvalues of covariance matrix (PCA) - linear-algebra

1st Q: Can someone explain the connection between the covariance and the eigenvectors?
2nd Q: How do dependencies in the data effect my PCA and how is the "best" component then chosen?
This could happen with easy detectable things like height, weight and BMI or less obvious things.
Maybe someone can recommend a video or explain clearly how to think about the eigenvectors extracted from the covariance matrix.
I understand them as the characteristic components which stay the same after applying matrix transformation.
But here we don't apply any transformation so how to interpret them now.
I read so far that eigenvalues are the "amount" of how much the extracted eigenvector spans/explains space. 3rd Q:Is this wrong ?
# Testing something for us !
import numpy as np
import pandas as pd
x = np.array((range(1,11)))
y = np.array((range(20,220,20)))
z = x*y
data = np.array([x,y,z])
test_matrix = np.cov(data)
#print(test_matrix)
eigenval, eigenvect = np.linalg.eig(test_matrix)
print(eigenval)
#sorting
sorting_indices = eigenval.argsort()[::-1] # not needed in our case
eigenval = eigenval[sorting_indices]
eigenvect = eigenvect[sorting_indices]
print(eigenval)
print(eigenvect_test)
# mapping
result = np.dot(eigenvect.transpose(), data)
# first component explanation amount
first_comp_perc =eigenval[0]/sum(eigenval)
print(f'The first component makes up for {round(first_comp_perc*100, 2)}% of the data variance')

Related

PCA scores for only the first principal components are of "wrong" sign

I am currently trying to get into principal component analysis and regression. I therefore tried caclulating the principal components of a given matrix by hand and compare it with the results you get out of the r-package rcomp.
The following is the code for doing pca by hand
### compute principal component loadings and scores by hand
df <- matrix(nrow = 5, ncol = 3, c(90,90,60,60,30,
60,90,60,60,30,
90,30,60,90,60))
# calculate covariance matrix to see variance and covariance of
cov.mat <- cov.wt(df)
cen <- cov.mat$center
n.obs <- cov.mat$n.obs
cv <- cov.mat$cov * (1-1/n.obs)
## calcualate the eigenvector and values
edc <- eigen(cv, symmetric = TRUE)
ev <- edc$values
evec <- edc$vectors
cn <- paste0("Comp.", 1L:ncol(cv))
cen <- cov.mat$center
### get loadings (or principal component weights) out of the eigenvectors and compute scores
loadings <- structure(edc$vectors, class = "loadings")
df.scaled <- scale(df, center = cen, scale = FALSE)
scr <- df.scaled %*% evec
I compared my results to the ones obtained by using the princomp-package
pca.mod <- princomp(df)
loadings.mod <- pca.mod$loadings
scr.mod <- pca.mod$scores
scr
scr.mod
> scr
[,1] [,2] [,3]
[1,] -6.935190 32.310906 7.7400588
[2,] -48.968014 -19.339313 -0.3529382
[3,] 1.733797 -8.077726 -1.9350147
[4,] 13.339605 18.519500 -9.5437444
[5,] 40.829802 -23.413367 4.0916385
> scr.mod
Comp.1 Comp.2 Comp.3
[1,] 6.935190 32.310906 7.7400588
[2,] 48.968014 -19.339313 -0.3529382
[3,] -1.733797 -8.077726 -1.9350147
[4,] -13.339605 18.519500 -9.5437444
[5,] -40.829802 -23.413367 4.0916385
So apparently, I did quite good. The computed scores equal at least scale-wise. However: The scores for the first pricipal components differ in the sign. This is not the case for the other two.
This leads to two questions:
I have read that it is no problem multiplying the loadings and the scores of principal components by minus one. Does this hold, when only one of the principal components are of a different sign as well?
What am I doing "wrong" from a computational standpoint? The procedure seems straightforward to me and I dont see what I could change in my own calculations to get the same signs as the princomp-package.
When checking this with the mtcars data set, the signs for my first PC were right, however now the second and fourth PC scores are of different signs, compared to the package. I can not make any sense of this. Any help is appreciated!
The signs of eigenvectors and loadings are arbitrary, so there is nothing "wrong" here. The only thing that you should expect to be preserved is the overall pattern of signs within each loadings vector, i.e. in the example above the princomp answer for PC1 gives +,+,-,-,- while yours gives -,-,+,+,+. That's fine. If yours gave e.g. -,+,-,-,+ that would be trouble (because the two would no longer be equivalent up to multiplication by -1).
However, while it's generally true that the signs are arbitrary and hence could vary across algorithms, compilers, operating systems, etc., there's an easy solution in this particular case. princomp has a fix_sign argument:
fix_sign: Should the signs of the loadings and scores be chosen so that
the first element of each loading is non-negative?
Try princomp(df,fix_sign=FALSE)$scores and you'll see that the signs (probably!) line up with your results. (In general the fix_sign=TRUE option is useful because it breaks the symmetry in a specific way and thus will always result in the same answers across all platforms.)

Generate random natural numbers that sum to a given number and comply to a set of general constraints

I had an application that required something similar to the problem described here.
I too need to generate a set of positive integer random variables {Xi} that add up to a given sum S, where each variable might have constraints such as mi<=Xi<=Mi.
This I know how to do, the problem is that in my case I also might have constraints between the random variables themselves, say Xi<=Fi(Xj) for some given Fi (also lets say Fi's inverse is known), Now, how should one generate the random variables "correctly"? I put correctly in quotes here because I'm not really sure what it would mean here except that I want the generated numbers to cover all possible cases with as uniform a probability as possible for each possible case.
Say we even look at a very simple case:
4 random variables X1,X2,X3,X4 that need to add up to 100 and comply with the constraint X1 <= 2*X2, what would be the "correct" way to generate them?
P.S. I know that this seems like it would be a better fit for math overflow but I found no solutions there either.
For 4 random variables X1,X2,X3,X4 that need to add up to 100 and comply with the constraint X1 <= 2*X2, one could use multinomial distribution
As soon as probability of the first number is low enough, your
condition would be almost always satisfied, if not - reject and repeat.
And multinomial distribution by design has the sum equal to 100.
Code, Windows 10 x64, Python 3.8
import numpy as np
def x1x2x3x4(rng):
while True:
v = rng.multinomial(100, [0.1, 1/2-0.1, 1/4, 1/4])
if v[0] <= 2*v[1]:
return v
return None
rng = np.random.default_rng()
print(x1x2x3x4(rng))
print(x1x2x3x4(rng))
print(x1x2x3x4(rng))
UPDATE
Lots of freedom in selecting probabilities. E.g., you could make other (##2, 3, 4) symmetric. Code
def x1x2x3x4(rng, pfirst = 0.1):
pother = (1.0 - pfirst)/3.0
while True:
v = rng.multinomial(100, [pfirst, pother, pother, pother])
if v[0] <= 2*v[1]:
return v
return None
UPDATE II
If you start rejecting combinations, then you artificially bump probabilities of one subset of events and lower probabilities of another set of events - and total sum is always 1. There is NO WAY to have uniform probabilities with conditions you want to meet. Code below runs with multinomial with equal probabilities and computes histograms and mean values. Mean supposed to be exactly 25 (=100/4), but as soon as you reject some samples, you lower mean of first value and increase mean of the second value. Difference is small, but UNAVOIDABLE. If it is ok with you, so be it. Code
import numpy as np
import matplotlib.pyplot as plt
def x1x2x3x4(rng, summa, pfirst = 0.1):
pother = (1.0 - pfirst)/3.0
while True:
v = rng.multinomial(summa, [pfirst, pother, pother, pother])
if v[0] <= 2*v[1]:
return v
return None
rng = np.random.default_rng()
s = 100
N = 5000000
# histograms
first = np.zeros(s+1)
secnd = np.zeros(s+1)
third = np.zeros(s+1)
forth = np.zeros(s+1)
mfirst = np.float64(0.0)
msecnd = np.float64(0.0)
mthird = np.float64(0.0)
mforth = np.float64(0.0)
for _ in range(0, N): # sampling with equal probabilities
v = x1x2x3x4(rng, s, 0.25)
q = v[0]
mfirst += np.float64(q)
first[q] += 1.0
q = v[1]
msecnd += np.float64(q)
secnd[q] += 1.0
q = v[2]
mthird += np.float64(q)
third[q] += 1.0
q = v[3]
mforth += np.float64(q)
forth[q] += 1.0
x = np.arange(0, s+1, dtype=np.int32)
fig, axs = plt.subplots(4)
axs[0].stem(x, first, markerfmt=' ')
axs[1].stem(x, secnd, markerfmt=' ')
axs[2].stem(x, third, markerfmt=' ')
axs[3].stem(x, forth, markerfmt=' ')
plt.show()
print((mfirst/N, msecnd/N, mthird/N, mforth/N))
prints
(24.9267492, 25.0858356, 24.9928602, 24.994555)
NB! As I said, first mean is lower and second is higher. Histograms are a little bit different as well
UPDATE III
Ok, Dirichlet, so be it. Lets compute mean values of your generator before and after the filter. Code
import numpy as np
def generate(n=10000):
uv = np.hstack([np.zeros([n, 1]),
np.sort(np.random.rand(n, 2), axis=1),
np.ones([n,1])])
return np.diff(uv, axis=1)
a = generate(1000000)
print("Original Dirichlet sample means")
print(a.shape)
print(np.mean((a[:, 0] * 100).astype(int)))
print(np.mean((a[:, 1] * 100).astype(int)))
print(np.mean((a[:, 2] * 100).astype(int)))
print("\nFiltered Dirichlet sample means")
q = (a[(a[:,0]<=2*a[:,1]) & (a[:,2]>0.35),:] * 100).astype(int)
print(q.shape)
print(np.mean(q[:, 0]))
print(np.mean(q[:, 1]))
print(np.mean(q[:, 2]))
I've got
Original Dirichlet sample means
(1000000, 3)
32.833758
32.791228
32.88054
Filtered Dirichlet sample means
(281428, 3)
13.912784086871243
28.36360987535
56.23109285501087
Do you see the difference? As soon as you apply any kind of filter, you alter the distribution. Nothing is uniform anymore
Ok, so I have this solution for my actual question where I generate 9000 triplets of 3 random variables by joining zeros to sorted random tuple arrays and finally ones and then taking their differences as suggested in the answer on SO I mentioned in my original question.
Then I simply filter out the ones that don't match my constraints and plot them.
S = 100
def generate(n=9000):
uv = np.hstack([np.zeros([n, 1]),
np.sort(np.random.rand(n, 2), axis=1),
np.ones([n,1])])
return np.diff(uv, axis=1)
a = generate()
def plotter(a):
fig = plt.figure(figsize=(10, 10), dpi=100)
ax = fig.add_subplot(projection='3d')
surf = ax.scatter(*zip(*a), marker='o', color=a / 100)
ax.view_init(elev=25., azim=75)
ax.set_xlabel('$A_1$', fontsize='large', fontweight='bold')
ax.set_ylabel('$A_2$', fontsize='large', fontweight='bold')
ax.set_zlabel('$A_3$', fontsize='large', fontweight='bold')
lim = (0, S);
ax.set_xlim3d(*lim);
ax.set_ylim3d(*lim);
ax.set_zlim3d(*lim)
plt.show()
b = a[(a[:, 0] <= 3.5 * a[:, 1] + 2 * a[:, 2]) &\
(a[:, 1] >= (a[:, 2])),:] * S
plotter(b.astype(int))
As you can see, the distribution is uniformly distributed over these arbitrary limits on the simplex but I'm still not sure if I could forego throwing away samples that don't adhere to the constraints (work the constraints somehow into the generation process? I'm almost certain now that it can't be done for general {Fi}). This could be useful in the general case if your constraints limit your sampled area to a very small subarea of the entire simplex (since resampling like this means that to sample from the constrained area a you need to sample from the simplex an order of 1/a times).
If someone has an answer to this last question I will be much obliged (will change the selected answer to his).
I have an answer to my question, under a general set of constraints what I do is:
Sample the constraints in order to evaluate s, the constrained area.
If s is big enough then generate random samples and throw out those that do not comply to the constraints as described in my previous answer.
Otherwise:
Enumerate the entire simplex.
Apply the constraints to filter out all tuples outside the constrained area.
List the resulting filtered tuples.
When asked to generate, I generate by choosing uniformly from this result list.
(note: this is worth my effort only because I'm asked to generate very often)
A combination of these two strategies should cover most cases.
Note: I also had to handle cases where S was a randomly generated parameter (m < S < M) in which case I simply treat it as another random variable constrained between m and M and I generate it together with the rest of the variables and handle it as I described earlier.

PyQt-Fit's NonParamRegression vs. R's loess

Are those two functions more or less equivalent? For example, if I have an R call like:
loess(formula = myformula, data = mydata, span = myspan, degree = 2, normalize = TRUE, family = "gaussian")
How can I obtain the same or similar result with PyQt-Fit? Should I simply call the smooth.NonParamRegression function (http://pythonhosted.org/PyQt-Fit/NonParam_tut.html) with method=npr_methods.LocalPolynomialKernel(q=2)? What about other parameters, such as span, and family?
UPDATE
I do realize the two implementations are likely not equivalent (https://www.statsdirect.com/help/nonparametric_methods/loess.htm). But any comments regarding "approximating" their outcomes are appreciated.
Statsmodels has a LOWESS implementation
(http://www.statsmodels.org/devel/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html).
Check out this post on the difference between LOESS and LOWESS: https://stats.stackexchange.com/questions/161069/difference-between-loess-and-lowess
Quick example on how to use statsmodels' lowess function in Python
import numpy as np
import statsmodels.api as sm
lowess = sm.nonparametric.lowess
Generate two random arrays, x and y:
x = np.random.rand(100, 1)
y = np.random.rand(100, 1)
Run the lowess function (Frac refers to bandwidth. Note that frac and it are set arbitrarily. Also, not all parameters are specified here, some are set to default. For more, see the official documentation):
results = lowess(y, x, frac=0.05, it=3)
The results are stored in a two-dimensional array. The first column contains the sorted x (exog) values and the second column the associated estimated y (endog) values.
If, for instance, you'd like to construct the residuals, you can proceed as follows:
res = y - results[:,1]

Optimized fitting coefficients for better fitting

I'm running a nonlinear least squares using the minpack.lm package.
However, for each group in the data I would like optimize (minimize) fitting parameters like similar to Python's minimize function.
The minimize() function is a wrapper around Minimizer for running an
optimization problem. It takes an objective function (the function
that calculates the array to be minimized), a Parameters object, and
several optional arguments.
The reason why I need this is that I want to optimize fitting function based on the obtained fitting parameters to find global fitting parameters that can fit both of the groups in the data.
Here is my current approach for fitting in groups,
df <- data.frame(y=c(replicate(2,c(rnorm(10,0.18,0.01), rnorm(10,0.17,0.01))),
c(replicate(2,c(rnorm(10,0.27,0.01), rnorm(10,0.26,0.01))))),
DVD=c(replicate(4,c(rnorm(10,60,2),rnorm(10,80,2)))),
gr = rep(seq(1,2),each=40),logic=rep(c(1,0),each=40))
the fitting equation of these groups is
fitt <- function(data) {
fit <- nlsLM(y~pi*label2*(DVD/2+U1)^2,
data=data,start=c(label2=1,U1=4),trace=T,control = nls.lm.control(maxiter=130))
}
library(minpack.lm)
library(plyr) # will help to fit in groups
fit <- dlply(df, c('gr'), .fun = fitt) #,"Die" only grouped by Waferr
> fit
$`1`
Nonlinear regression model
model: y ~ pi * label2 * (DVD/2 + U1)^2
data: data
label2 U1
2.005e-05 1.630e+03
$`2`
label2 U1
2.654 -35.104
I need to know are there any function that optimizes the sum-of-squares to get best fitting for both of the groups.
We may say that you already have the best fitting parameters as the residual sum-of-squares but I know that minimizer can do this but I haven't find any similar example we can do this in R.
ps. I made it up the numbers and fitting lines.
Not sure about r, but having least squares with shared parameters is usually simple to implement.
A simple python example looks like:
import matplotlib
matplotlib.use('Qt4Agg')
from matplotlib import pyplot as plt
from random import random
from scipy import optimize
import numpy as np
#just for my normal distributed errord
def boxmuller(x0,sigma):
u1=random()
u2=random()
ll=np.sqrt(-2*np.log(u1))
z0=ll*np.cos(2*np.pi*u2)
z1=ll*np.cos(2*np.pi*u2)
return sigma*z0+x0, sigma*z1+x0
#some non-linear function
def f0(x,a,b,c,s=0.05):
return a*np.sqrt(x**2+b**2)-np.log(c**2+x)+boxmuller(0,s)[0]
# residual function for least squares takes two data sets.
# not necessarily same length
# two of three parameters are common
def residuals(parameters,l1,l2,dataPoints):
a,b,c1,c2 = parameters
set1=dataPoints[:l1]
set2=dataPoints[-l2:]
distance1 = [(a*np.sqrt(x**2+b**2)-np.log(c1**2+x))-y for x,y in set1]
distance2 = [(a*np.sqrt(x**2+b**2)-np.log(c2**2+x))-y for x,y in set2]
res = distance1+distance2
return res
xList0=np.linspace(0,8,50)
#some xy data
xList1=np.linspace(0,7,25)
data1=np.array([f0(x,1.2,2.3,.33) for x in xList1])
#more xy data using different third parameter
xList2=np.linspace(0.1,7.5,28)
data2=np.array([f0(x,1.2,2.3,.77) for x in xList2])
alldata=np.array(zip(xList1,data1)+zip(xList2,data2))
# rough estimates
estimate = [1, 1, 1, .1]
#fitting; providing second length is actually redundant
bestFitValues, ier= optimize.leastsq(residuals, estimate,args=(len(data1),len(data2),alldata))
print bestFitValues
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xList1, data1)
ax.scatter(xList2, data2)
ax.plot(xList0,[f0(x,bestFitValues[0],bestFitValues[1],bestFitValues[2] ,s=0) for x in xList0])
ax.plot(xList0,[f0(x,bestFitValues[0],bestFitValues[1],bestFitValues[3] ,s=0) for x in xList0])
plt.show()
#output
>> [ 1.19841984 2.31591587 0.34936418 0.7998094 ]
If required you can even make your minimization yourself. If your parameter space is sort of well behaved, i.e. approximately parabolic minimum, a simple Nelder Mead method is quite OK.

Math behind Conv2D function in Keras

I am using Conv2D model of Keras 2.0. However, I cannot fully understand what the function is doing mathematically. I try to understand the math using randomly generated data and a very simple network:
import numpy as np
import keras
from keras.layers import Input, Conv2D
from keras.models import Model
from keras import backend as K
# create the model
inputs = Input(shape=(10,10,1)) # 1 channel, 10x10 image
outputs = Conv2D(32, (3, 3), activation='relu', name='block1_conv1')(inputs)
model = Model(outputs=outputs, inputs=inputs)
# input
x = np.random.random(100).reshape((10,10))
# predicted output for x
y_pred = model.predict(x.reshape((1,10,10,1))) # y_pred.shape = (1,8,8,32)
I tried to calculate, for example, the value of the first row, the first column in the first feature map, following the demo in here.
w = model.layers[1].get_weights()[0] # w.shape = (3,3,1,32)
w0 = w[:,:,0,0]
b = model.layers[1].get_weights()[1] # b.shape = (32,)
b0 = b[0] # b0 = 0
y_pred_000 = np.sum(x[0:3,0:3] * w0) + b0
But relu(y_pred_000) is not equal to y_pred[0][0][0][0].
Could anyone point out what's wrong with my understanding? Thank you.
It's easy and it comes from Theano dim ordering. The result of applying filter in stored in a so called channel dimension. In case of TensorFlow this is the last dimension and that's why results are good. In case of Theano it's second dimension (convolution result has shape (cases, channels, width, height) so in order to solve your problem you need to change prediction line to:
y_pred = model.predict(x.reshape((1,1,10,10)))
Also you need to change the way you get the weights as weights in Theano has shape (output_channels, input_channels, width, height) you need to change the weight getter to:
w = model.layers[1].get_weights()[0] # w.shape = (32,1,3,3)
w0 = w[0,0,:,:]

Resources