Correlation coefficient from sets of x,y coordinates - cross-correlation

I have 2 sets of x,y coordinates and I am trying to determine their correlation coefficient. One way to do that is to calculate separately the correlation coefficient for the x-values and the y-values. Is it fair to then take the arithmetic mean of these individual correlation coefficient to estimate the correlation coefficient of the x,y pairs?

Taking the arithmetic mean of individual correlation coefficient to estimate the correlation coefficient of the x,y pairs is Not correct.
You can use cross-correlation of complex (two-dimensional) signals (arrays).
For cross-correlation of two complex arrays (a, b), you need:
Take the complex conjugate of a after the FFT. (a1)
Take FFT of b. (b1)
array c = vector multiply a1 and b1.
d = Take InverseFFT of c.
To find the maximum of the correlation function you can use a maximum of lengths of complex vectors in d.
https://dsp.stackexchange.com/questions/736/how-do-i-implement-cross-correlation-to-prove-two-audio-files-are-similar

Related

PCA in R - trying to compute Normalize scores instead of normalize loadings

I am trying to calculate normalize scores of my PC instead of normalize loadings. I have tried to use princomp and prcomp package. All my data are scaled and centered.
In Eviews there is just one click by default:
the loadings are normalized so the observation scores have norms proportional to the eigenvalues (Normalize loadings). You may instead choose to normalize the scores instead of the loadings (Normalize scores) so that the observation score norms are proportional to unity.
In the other forum I found that: "
Normalize loadings - means that your scores will have variances equal to the estimated eigenvalues
Normalize scores - means that your scores will have unit variances
The latter corresponds to whether you want the eigenvalue decomposition computed using the covariance or the correlation matrix of the data."
I did that, and the result was still not similar to eviews normalized score.
Thanks in advance

why the calculation of eigenvectors and eigenvalues when performing PCA is so effective?

The core of Principal Componenet Analysis (PCA) lies at calculating eigenvalues and eigenvectors from the variance-covariance matrix corresponding to some dataset (for example, a matrix of multivariate data coming from a set of individuals). Text-book knowledge I have is that:
a) by multipliying such eigenvectors with the original data matrix one can calculate "scores" (as many as orignal set of variables) which are independent from each other
b) the eigenvalues summarize the amount of variance of each score.
These two properties make this process a very effective data transformation technique to simplify the analysis of multivariate data.
My question is why is that so? why is calculating eigenvalues and eigenvectors from a covariance-variance matrix results in such unique properties of the scores?

how to generate a zero-mean Gaussian random vector ,which have correlation matrices whose eigenvalues are exponentially distributed

1.This is a question from paper "Fast Generalized Eigenvector Tracking Based on the Power Method".
2.The author wrote "We generate two zero-mean Gaussian random vectors ,which have correlation matrices A and B whose eigenvalues are exponentially distributed".
3.But how to generate a zero-mean Gaussian random vector ,which have correlation matrices whose eigenvalues are exponentially distributed ,this confused me almost a week.
4.It seems that we could only use randn in MATLAB to generate random vector,
so the problem is how to make sure correlation matrices whose eigenvalues exponentially distributed at the same time?
Let S be a positive definite matrix. Therefore S has a Cholesky decomposition L.L' = S where L is a lower-triangular matrix and ' denotes the matrix transpose and . denotes matrix multiplication. Let x be drawn from a Gaussian distribution with mean zero and covariance equal to the identity matrix. Then y = L.x has a Gaussian distribution with mean zero and covariance S.
So if you can find suitable covariance matrices A and B, you can use their Cholesky decompositions to generate samples. Now about constructing a matrix which has eigenvalues following a given distribution. My advice is to start with a list of samples from an exponential distribution; these will be your eigenvalues. Let E = a matrix with the exponential samples on the diagonal and zeros otherwise. Let U be any unitary matrix (i.e. columns are orthogonal and norm of each column is 1). Then U.E.U' is a positive definite matrix with the specified eigenvalues.
U can be any unitary matrix. In particular U can be the identity matrix. That might make everything else simpler; you'll have to verify whether U = identity is workable for the problem you're working on.

Inverse Moments of a Non-Central Chi-Square Distribution

I want to compute inverse moments and truncated inverse moments of a non-central chi-square distribution in R. How can I do that in R?
Suppose X follows the non-central chi-square distribution with degrees of freedom "k" and non-centrality parameter "t". My problem is to numerically compute the following expectations for various values of "t" so I can simulate the risk of James-Stein type estimators.
(i) E[X^(-1)] and E[X^(-2)]
(ii) E[X^(-1)I(A)] where I(A) is an indicator function of set A
(iii) E[1-c{X^(-2)}I(A)] where c is a constant.
In general, you can numerically compute the expected value of a random variable by drawing a large number of samples and then averaging them. For instance, you could estimate the expected values of X^(-1) and X^(-2) with something like:
mean(rchisq(1000000, df=3, ncp=10)^-1)
# [1] 0.1152163
mean(rchisq(1000000, df=3, ncp=10)^-2)
# [1] 0.1371877
Paolella's book, Intermediate Probability gives the moments of the non-central chi-square to various powers. See equation (10.10). You can find R code for these in the sadists package.

How to set a weighted least-squares in r for heteroscedastic data?

I'm running a regression on census data where my dependent variable is life expectancy and I have eight independent variables. The data is aggregated be cities, so I have many thousand observations.
My model is somewhat heteroscedastic though. I want to run a weighted least-squares where each observation is weighted by the city’s population. In this case, it would mean that I want to weight the observations by the inverse of the square root of the population. It’s unclear to me, however, what would be the best syntax. Currently, I have:
Model=lm(…,weights=(1/population))
Is that correct? Or should it be:
Model=lm(…,weights=(1/sqrt(population)))
(I found this question here: Weighted Least Squares - R but it does not clarify how R interprets the weights argument.)
From ?lm: "weights: an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used." R doesn't do any further interpretation of the weights argument.
So, if what you want to minimize is the sum of (the squared distance from each point to the fit line * 1/sqrt(population) then you want ...weights=(1/sqrt(population)). If you want to minimize the sum of (the squared distance from each point to the fit line * 1/population) then you want ...weights=1/population.
As to which of those is most appropriate... that's a question for CrossValidated!
To answer your question, Lucas, I think you want weights=(1/population). R parameterizes the weights as inversely proportional to the variances, so specifying the weights this way amounts to assuming that the variance of the error term is proportional to the population of the city, which is a common assumption in this setting.
But check the assumption! If the variance of the error term is indeed proportional to the population size, then if you divide each residual by the square root of its corresponding sample size, the residuals should have constant variance. Remember, dividing a random variable by a constant results in the variance being divided by the square of that constant.
Here's how you can check this: Obtain residuals from the regression by
residuals = lm(..., weights = 1/population)$residuals
Then divide the residuals by the square roots of the population variances:
standardized_residuals = residuals/sqrt(population)
Then compare the sample variance among the residuals corresponding to the bottom half of population sizes:
variance1 = var(standardized_residuals[population < median(population)])
to the sample variance among the residuals corresponding to the upper half of population sizes:
variance2 = var(standardized_residuals[population > median(population)])
If these two numbers, variance1 and variance2 are similar, then you're doing something right. If they are drastically different, then maybe your assumption is violated.

Resources