Can PCA or Principal component regression reveal information not seen in the univariate case? - r

I am wondering if there is a case where you see something in the principal components (PC) what you do not see by looking univariately at the variables that the PCA is based on. For instance, considering the case of group differences: that you see a separation of two groups in one of the PCs, but not in a single variable (univariate).
I will use an example in the two dimensional setting to better illustrate my question: Lets suppose we have two groups, A and B, and for each observations we have two multivariate-normal distributed covariables.
# First Setting:
group_A <- mvrnorm(n=1000, mu=c(0,0), Sigma=matrix(c(10,3,3,2),2,2))
group_B <- mvrnorm(n=1000, mu=c(10,3), Sigma=matrix(c(10,3,3,2),2,2))
dat <- rbind(cbind.data.frame(group_A, group="A"),cbind.data.frame(group_B, group="B"))
plot(dat[,1:2], xlab="x", ylab="y", col=dat[,"group"])
In this first setting you see a group separation in the variable x, in the variable y, and you will also see a separation in both principal components. Hence, using the PCA we get the same result we got in the univariate case: the groups A and B have different values in the variables x and y.
In a second example generated by myself, you do not see a separation in variable x, variable y, or in PC1 or PC2. Hence, although our common sense suggests that we can distinguish between the two groups based on x and y, we do not observe this in the univariate case and the PCA doesn't help us either:
# Second setting
group_A <- mvrnorm(n=1000, mu=c(0,0), Sigma=matrix(c(10,3,3,2),2,2))
group_B <- mvrnorm(n=1000, mu=c(0,0), Sigma=matrix(c(10,-3,-3,2),2,2))
dat <- rbind(cbind.data.frame(group_A, group="A"),cbind.data.frame(group_B, group="B"))
plot(dat[,1:2], xlab="x", ylab="y", col=dat[,"group"])
QUESTION: Is there a case in where the PCA helps us in extracting correlations or separations we would not see in the univariate case? Can you construct one or is this not possible in the two-dimensional case.
Thank you all in advance for helping me to disentanglie this.

I think your question is mainly the result of a misunderstanding of what PCA does. It does't find clusters of data like, say, kmeans or DBSCAN. It projects n-dimensional data onto an orthogonal basis. Then it selects the top k dimensions (according to variance explained), where k < n.
So in your example, PCA doesn't know that group A was generated by some distribution and group B by another. It just sees the data in 2 dimensions and finds two principle components (from which you may or may not select 1). You might as well plot all 2000 data points in the same color.
However, if you wanted to use PCA in this instance, you would indicate that a 3rd dimension distinguishes between group A and group B. You could, for example, label group A +1 and group B -1 (or something that makes sense relative to the scale of the other dimensions). Then perform PCA on 3 dimensions, reducing to 2 or 1, depending on what the eigenvalues tell you about the variation explained.

Related

Clustering with R 'Mclust' function: setting priors for output parameters

I am using the R package mclust to separate data into clusters. For this, I am using a uni-dimensional that allows for variable variances of the normal distributions underlying the clustering (the "V" model in the package).
The function looks like this: Mclust(dataToCluster, G=possibleClusters, modelNames=c("V")). To define the number of clusters possible, I use an array possibleClusters, e. g. 1:4 to allow for one to four clusters.
As a result of the clustering, after automatic model selection by Mclust using the BIC, I get a result with parameters of a normal distribution. For a model with three clusters, it might look like this:
# output shortened and commented for better readibility
> result$parameters
# proportion of data points per cluster ("lambda")
$pro
[1] 0.3459566 0.3877521 0.2662913
# mean of normal distribution per cluster ("mu")
$mean
1 2 3
110.3197 204.0477 265.0929
# variances per cluster ("sigma sq")
$variance$sigmasq
[1] 342.5032 128.4648 254.9257
However, I do have some knowledge about what these parameters are supposed to look like a priori. For example, I might know that:
sigmasq must be between 100 and 1000 units
the mean value for adjacent clusters must be at least 40 units apart
if there are three clusters, the mean value of the third cluster must be at least 215 units
Here is a graphical example for possible results of the clustering (the x axis corresponds to the units of mean and sigma unsquared):
Taking into account the constraints given above, example plots A1 (according to rules 1 and 2) and B1 (according to rules 2 and 3) can't be correct. Instead, the results should look more like A2 and B2, which were produced using slightly different data. Note that, taking into account these constraints, the “best” number of clusters might change (A1 vs. A2).
I would like to know how to include this kind of a priori information when using the Mclust function. The function does have a parameter prior, which might allow for this but I wasn't able to figure out how this could work. How could I bring the constraints into the function?

Preferentially Sampling Based upon Value Size

So, this is something I think I'm complicating far too much but it also has some of my other colleagues stumped as well.
I've got a set of areas represented by polygons and I've got a column in the dataframe holding their areas. The distribution of areas is heavily right skewed. Essentially I want to randomly sample them based upon a distribution of sampling probabilities that is inversely proportional to their area. Rescaling the values to between zero and one (using the {​​​​​​​​x-min(x)}​​​​​​​​/{​​​​​​​​max(x)-min(x)}​​​​​​​​ method) and subtracting them from 1 would seem to be the intuitive approach, but this would simply mean that the smallest are almost always the one sampled.
I'd like a flatter (but not uniform!) right-skewed distribution of sampling probabilities across the values, but I am unsure on how to do this while taking the area values into account. I don't think stratifying them is what I am looking for either as that would introduce arbitrary bounds on the probability allocations.
Reproducible code below with the item of interest (the vector of probabilities) given by prob_vector. That is, how to generate prob_vector given the above scenario and desired outcomes?
# Data
n= 500
df <- data.frame("ID" = 1:n,"AREA" = replicate(n,sum(rexp(n=8,rate=0.1))))
# Generate the sampling probability somehow based upon the AREA values with smaller areas having higher sample probability::
prob_vector <- ??????
# Sampling:
s <- sample(df$ID, size=1, prob=prob_vector)```
There is no one best solution for this question as a wide range of probability vectors is possible. You can add any kind of curvature and slope.
In this small script, I simulated an extremely right skewed distribution of areas (0-100 units) and you can define and directly visualize any probability vector you want.
area.dist = rgamma(1000,1,3)*40
area.dist[area.dist>100]=100
hist(area.dist,main="Probability functions")
area = seq(0,100,0.1)
prob_vector1 = 1-(area-min(area))/(max(area)-min(area)) ## linear
prob_vector2 = .8-(.6*(area-min(area))/(max(area)-min(area))) ## low slope
prob_vector3 = 1/(1+((area-min(area))/(max(area)-min(area))))**4 ## strong curve
prob_vector4 = .4/(.4+((area-min(area))/(max(area)-min(area)))) ## low curve
legend("topright",c("linear","low slope","strong curve","low curve"), col = c("red","green","blue","orange"),lwd=1)
lines(area,prob_vector1*500,col="red")
lines(area,prob_vector2*500,col="green")
lines(area,prob_vector3*500,col="blue")
lines(area,prob_vector4*500,col="orange")
The output is:
The red line is your solution, the other ones are adjustments to make it weaker. Just change numbers in the probability function until you get one that fits your expectations.

R - linear model does not match experimental data

I am trying to perform a linear regression on experimental data consisting of replicate measures of the same condition (for several conditions) to check for the reliability of the experimental data. For each condition I have ~5k-10k observations stored in a data frame df:
[1] cond1 repA cond1 repB cond2 repA cond2 repB ...
[2] 4.158660e+06 4454400.703 ...
[3] 1.458585e+06 4454400.703 ...
[4] NA 887776.392 ...
...
[5024] 9571785.382 9.679092e+06 ...
I use the following code to plot scatterplot + lm + R^2 values (stored in rdata) for the different conditions:
for (i in seq(1,13,2)){
vec <- matrix(0, nrow = nrow(df), ncol = 2)
vec[,1] <- df[,i]
vec[,2] <- df[,i+1]
vec <- na.exclude(vec)
plot(log10(vec[,1]),log10(vec[,2]), xlab = 'rep A', ylab = 'rep B' ,col="#00000033")
abline(fit<-lm(log10(vec[,2])~log10(vec[,1])), col='red')
legend("topleft",bty="n",legend=paste("R2 is",rdata[1,((i+1)/2)] <- format(summary(fit)$adj.r.squared,digits=4)))
}
However, the lm seems to be shifted so that it does not fit the trend I see in the experimental data:
It consistently occurs for every condition. I unsuccesfully tried to find an explanation by looking up the scource code and browsing different forums and posts (this or here).
Would have like to simply comment/ask a few questions, but can't.
From what I've understood, both repA and repB are measured with error. Hence, you cannot fit your data using an ordinary least square procedure, which only takes into account the error in Y (some might argue a weighted OLS may work, however I'm not skilled enough to discuss that). Your question seem linked to this one.
What you can use is a total least square procedure: it takes into account the error in X and Y. In the example below, I've used a "normal" TLS assuming there is the same error in X and Y (thus error.ratio=1). If it is not, you can specify the error ratio by entering error.ratio=var(y1)/var(x1) (at least I think it's var(Y)/var(X): check on the documentation to ensure that).
library(mcr)
MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,method.ci="analytical")
MCR_intercept=getCoefficients(MCR_reg)[1,1]
MCR_slope=getCoefficients(MCR_reg)[2,1]
# CI for predicted values
x_to_predict=seq(0,35)
predicted_values=MCResultAnalytical.calcResponse(MCR_reg,x_to_predict,alpha=0.05)
CI_low=predicted_values[,4]
CI_up=predicted_values[,5]
Please note that, in Deming/TLS regressions, your x- and y-errors are supposed to follow normal distribution, as explained here. If it's not the case, go for a Passing-Bablok regressions (and the R code is here).
Also note that the R2 isn't defined for Deming nor Passing Bablok regressions (see here). A correlation coefficient is a good proxy, although it does not exactly provide the same information. Since you're studying a linear correlation between two factors, see Pearson's product moment correlation coefficient, and use e.g. the rcorrfunction.

How to generate OUTLIER-FREE data in R?

I would like to know how can I generate an OUTLIER-FREE data using R.
I'm generating data using RNORM.
Say I have a linear equation
Y = B0 + B1*X + E, where X~N(5,9) and E~N(0,1).
I'm going to use RNORM in generating X and E.
Below are the codes used:
X <- rnorm(50,5,3) #I'm generating 50 Xi's w/ mean=5 & var=9
E <- rnorm(50,0,1) #I'm generating 50 residuals w/ mean=0 & var=1
Now, I'm going to generate Y by plugging the generated data on X & E above in the linear equation.
If the data I've generated above is outlier-free (no influential observation), then no Cook's Distance of observations should exceed 4/n, which is the usual cut-off for detecting influential/outlying observations.
But I wasn't not able to get this so far. I'm still getting outliers once I generate data following this procedure.
Can you help me out on this? Do you know a way how can I generate data which is OUTLIER-FREE.
Thanks a lot!
Well, one way would be to detect and delete those outliers by finding the generated points that exceed some cutoff. Of course this would harm the "randomness" in your generated data but your request for outlier-free data implies that by definition. Possibly, decreasing the variance of X could also help.
Is there a particular reason you need the X's to be normally distributed? The assumption of normality in regression is for the residuals (the error term). Typically the measured independent variable won't be normally distributed -- in a balanced, (quasi-)experimental setup, the X's should be close to uniformly distributed. A uniform distribution for the X's (or even an evenly divided sequence generated with seq()) would help you here because the "outlierness" of outliers arises from being both being far from the center from the sample space and being comparatively few in number. With a uniform distribution, they are no longer few in number, which reduces their leverage.
As a sidebar: real-data has outliers. This is actually one of the ways we can detect touched-up or even faked data in science. If you're interested in simulations that correspond to something in reality, then outliers may not be a bad thing. And there is a whole world of robust methods for dealing with data with arbitrarily bad outliers in a principled way as opposed to arbitrary cutoff points.

Is there an implementation of loess in R with more than 3 parametric predictors or a trick to a similar effect?

Calling all experts on local regression and/or R!
I have run into a limitation of the standard loess function in R and hope you have some advice. The current implementation supports only 1-4 predictors. Let me set out our application scenario to show why this can easily become a problem as soon as we want to employ globally fit parametric covariables.
Essentially, we have a spatial distortion s(x,y) overlaid over a number of measurements z:
z_i = s(x_i,y_i) + v_{g_i}
These measurements z can be grouped by the same underlying undistorted measurement value v for each group g. The group membership g_i is known for each measurement, but the underlying undistorted measurement values v_g for the groups are not known and should be determined by (global, not local) regression.
We need to estimate the two-dimensional spatial trend s(x,y), which we then want to remove. In our application, say there are 20 groups of at least 35 measurements each, in the most simple scenario. The measurements are randomly placed. Taking the first group as reference, there are thus 19 unknown offsets.
The below code for toy data (with a spatial trend in one dimension x) works for two or three offset groups.
Unfortunately, the loess call fails for four or more offset groups with the error message
Error in simpleLoess(y, x, w, span, degree, parametric, drop.square,
normalize, :
only 1-4 predictors are allowed"
I tried overriding the restriction and got
k>d2MAX in ehg136. Need to recompile with increased dimensions.
How easy would that be to do? I cannot find a definition of d2MAX anywhere, and it seems this might be hardcoded -- the error is apparently triggered by line #1359 in loessf.f
if(k .gt. 15) call ehg182(105)
Alternatively, does anyone know of an implementation of local regression with global (parametric) offset groups that could be applied here?
Or is there a better way of dealing with this? I tried lme with correlation structures but that seems to be much, much slower.
Any comments would be greatly appreciated!
Many thanks,
David
###
#
# loess with parametric offsets - toy data demo
#
x<-seq(0,9,.1);
x.N<-length(x);
o<-c(0.4,-0.8,1.2#,-0.2 # works for three but not four
); # these are the (unknown) offsets
o.N<-length(o);
f<-sapply(seq(o.N),
function(n){
ifelse((seq(x.N)<= n *x.N/(o.N+1) &
seq(x.N)> (n-1)*x.N/(o.N+1)),
1,0);
});
f<-f[sample(NROW(f)),];
y<-sin(x)+rnorm(length(x),0,.1)+f%*%o;
s.fs<-sapply(seq(NCOL(f)),function(i){paste('f',i,sep='')});
s<-paste(c('y~x',s.fs),collapse='+');
d<-data.frame(x,y,f)
names(d)<-c('x','y',s.fs);
l<-loess(formula(s),parametric=s.fs,drop.square=s.fs,normalize=F,data=d,
span=0.4);
yp<-predict(l,newdata=d);
plot(x,y,pch='+',ylim=c(-3,3),col='red'); # input data
points(x,yp,pch='o',col='blue'); # fit of that
d0<-d; d0$f1<-d0$f2<-d0$f3<-0;
yp0<-predict(l,newdata=d0);
points(x,y-f%*%o); # spatial distortion
lines(x,yp0,pch='+'); # estimate of that
op<-sapply(seq(NCOL(f)),function(i){(yp-yp0)[!!f[,i]][1]});
cat("Demo offsets:",o,"\n");
cat("Estimated offsets:",format(op,digits=1),"\n");
Why don't you use an additive model for this? Package mgcv will handle this sort of model, if I understand your Question, just fine. I might have this wrong, but the code you show is relating x ~ y, but your Question mentions z ~ s(x, y) + g. What I show below for gam() is for response z modelled by a spatial smooth in x and y with g being estimated parametrically, with g stored as a factor in the data frame:
require(mgcv)
m <- gam(z ~ s(x,y) + g, data = foo)
Or have I misunderstood what you wanted? If you want to post a small snippet of data I can give a proper example using mgcv...?

Resources