How to apply principal component analysis to standardised multicentric data? - bigdata

I have a question about principal component analysis.
I am working with a dataset with 2 cohorts from 2 different centres. From each center I have a control group and 2 patient subgroups (drug-resistant and drug-responsive). My objective is to analyse neurocognitive data, which all subjects received during the study. The problem is, the cognitive tests applied differ slightly across centres. I therefore standardized the raw score values in each patient subgroup relative to the control group of their respective center. Still I am left with a big dataset of z-scores and would like to further reduce dimensionality with PCA.
My question is, does it make sense to apply PCA after standardising data this way? (not sure if I can call them z-scores has I standardised them relative to the mean and standard deviation of the control of their respective centre and not of the entire sample!) The mean of the columns will therefore not be = 0. Would it still be legitimate to apply PCA? And do you think I should scale the variables again?
Any suggestions or comments are much appreciated!
Best wishes,
Bernardo

Related

when do the principle components of PCA form a basis for the dataset?

Suppose I do a PCA on a dataset and get k principles components that explains 100% of the total variance of the dataset.
We can say any observation from the dataset can be reconstructed by the mean plus a linear combination of all principle components.
When I do analysis on different datasets, I noticed sometimes observations can be reconstructed by a linear combination of the principle components alone(v1,v2,..,vk) without using the mean even though the observations do not have a zero mean.
This means the mean of the observations can be reconstructed using the linear combination of the principle components(v1,v2,..,vk) as well. It seems this happens when the observations are actually linear combinations of a different set of vectors(x1,x2,...,xk).
For a dataset containing vectors belonging to a vector space spanned by some vector x1,x2,...xk, if I do PCA on this dataset, will the principle components always be able to reconstruct the observations in the dataset alone without adding the mean? It feels like PC find an orthogonal basis for this dataset, but why?

How to take a Probability Proportional to Size (PPS) Unequal Probability sample using R?

I have very little programming experience, but I'm working on a statistics project and would like to generate an unequal probability sample where the inclusion probability of a unit is based on its size (PPS).
Basically, I have two datasets:
ds1 lists US states and the parameter I'm trying to estimate
ds2 has the population size of each state.
My questions:
I want to use R to select a random sample from the first dataset using inclusion probabilities based on the population of each state (second dataset).
Also is there any way to use R to calculate these Generalized Unequal Probability Estimator formulas?
Also just a note on the formulas: pi_i is inclusion probability and pi_ij is joint inclusion probability.
There is a package for the same in R - pps and the documentation is here.
Also, there is another package called survey with a bit of documentation here.
I'm not sure of the difference between the two and haven't used them myself. Hope this is what you're looking for.
Yes, that's called weighted sampling. Simply set the weight to the size of the state, strictly you don't even need to normalize them by 1/sum(sizes) although it's always good practice to. There are tons of duplicate posts on SO showing how to do weighted sampling.
The only tiny complication is that you need to do a join() of the datasets ds1, ds2. Show us what code you've tried if it's causing problems. Recommend you use either dplyr or data.table.
Your second question should be asked as a separate question, and is offtopic on SO, or at least won't get a great response - best to ask statistical questions at sister site CrossValidated

Which cluster methodology should I use for a multidimensional dataset?

I am trying to create clusters of countries with a dataset quite heterogeneous (the data I have on countries goes from median age to disposable income, including education levels).
How should I approach this problem?
I read some interesting papers on clustering, using K-means for instance, but it seems those algorithms are mostly used when there are two sets of variables, not 30 like in my case, and when the variables are comparable (it might be though to try to cluster countries with such diversity in the data).
Should I normalise some of the data? Should I just focus on fewer indicators to avoid this multidimensional issue? Use spectral clustering first?
Thanks a lot for the support!
Create a "similarity metric". Probably just a weight to all your measurements, but you might build in some corrections for population size and so on. Then you can only have low hundreds of countries, so most brute force methods will work. Hierarchical clustering would be my first point of call, and that will tell you if the data is inherently clustered.
If all the data is quantitative, you can normalise on 0 - 1 (lowest country is 0, highest is 1), then take eigenvectors. Then plot out the first two axes in eigenspace. That will give another visual fix on clusters.
If it's not clustered, however, it's better to admit that.

Simple algorithm for online outlier detection of a generic time series

I am working with a large amount of time series.
These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic).
I would like a simple algorithm for doing an online "outlier detection". Basically, I want to keep in memory (or on disk) the whole historical data for each time series, and I want to detect any outlier in a live scenario (each time a new sample is captured).
What is the best way to achieve these results?
I'm currently using a moving average in order to remove some noise, but then what next? Simple things like standard deviation, mad, ... against the whole data set doesn't work well (I can't assume the time series are stationary), and I would like something more "accurate", ideally a black box like:
double outlier_detection(double* vector, double value);
where vector is the array of double containing the historical data, and the return value is the anomaly score for the new sample "value" .
This is a big and complex subject, and the answer will depend on (a) how much effort you want to invest in this and (b) how effective you want your outlier detection to be. One possible approach is adaptive filtering, which is typically used for applications like noise cancelling headphones, etc. You have a filter which constantly adapts to the input signal, effectively matching its filter coefficients to a hypothetical short term model of the signal source, thereby reducing mean square error output. This then gives you a low level output signal (the residual error) except for when you get an outlier, which will result in a spike, which will be easy to detect (threshold). Read up on adaptive filtering, LMS filters, etc, if you're serious about this kind of technique.
I suggest the scheme below, which should be implementable in a day or so:
Training
Collect as many samples as you can hold in memory
Remove obvious outliers using the standard deviation for each attribute
Calculate and store the correlation matrix and also the mean of each attribute
Calculate and store the Mahalanobis distances of all your samples
Calculating "outlierness":
For the single sample of which you want to know its "outlierness":
Retrieve the means, covariance matrix and Mahalanobis distances from training
Calculate the Mahalanobis distance "d" for your sample
Return the percentile in which "d" falls (using the Mahalanobis distances from training)
That will be your outlier score: 100% is an extreme outlier.
PS. In calculating the Mahalanobis distance, use the correlation matrix, not the covariance matrix. This is more robust if the sample measurements vary in unit and number.

how do tell if its better to standardize your data matrix first when you do principal component analysis in R?

I'm trying to do principal component analysis in R . There are 2 ways of doing it, I believe.
One is doing principal component analysis right away the other way is standardizing the matrix first using s = scale(m) and then applying principal component analysis.
How do I tell what result is better? What values, in particular, should I look at? I already managed to find the eigenvalues and eigenvectors, the proportion of variance for each eigenvector using both methods.
I noticed that the proportion of the variance for the first pca without standardizing had a larger value. Is there a meaning to it? Isn't this always the case?
At last, if I am supposed to predict a variable ie weight should I drop the variable ie weight from my data matrix when I do a principal component analysis?
Are your variables measured on a common scale? If yes, then don't scale. If no, then it's probably a good idea to scale.
If you are trying to predict the value of another variable, PCA is probably not the correct tool. Maybe you should look at a regression model instead.

Resources