Inverse transformation corr coefficients to initial value - r

Let's work with classic dataset with iris
data(iris)
When I conduct Pearson corr analysis, i have these corr coefficients
SEPALLEN SEPALWID PETALLEN PETALWID
SEPALLEN 1,000000 -0,117570 0,871754 0,817941
SEPALWID -0,117570 1,000000 -0,428440 -0,366126
PETALLEN 0,871754 -0,428440 1,000000 0,962865
PETALWID 0,817941 -0,366126 0,962865 1,000000
So is there the way to perform inverse transformation, namely from corr coefficients to initial value of variables?

You cannot extract details of data of correlation data, only the general character of correlation between two columns. If Person's coefficient is positive than there is a increasing tendency, if negative then decreasing one. We can visualize it with correlation plot:
data(iris)
ibrary(PerformanceAnalytics)
chart.Correlation(iris[, 1:4], histogram=TRUE, pch=19)
As you can see below each upper trianglar number matches with a graph in a lower triangle. In fact the cor function transforms 600 entries in iris (1-4 columns) data into just 5 unique numbers. So inverse transformation from 5 numbers into 600 numbers in an unambigous way is not possible:

Related

Applying PCA to a covariance matrix

I am have some difficulty understanding some steps in a procedure. They take coordinate data, find the covariance matrix, apply PCA, then extract the standard deviation from the square root of each eigenvalue in short. I am trying to re-produce this process, but I am stuck on the steps.
The Steps Taken
The data set consists of one matrix, R, that contains coordiante paris, (x(i),y(i)) with i=1,...,N for N is the total number of instances recorded. We applied PCA to the covariance matrix of the R input data set, and the following variables were obtained:
a) the principal components of the new coordinate system, the eigenvectors u and v, and
b) the eigenvalues (λ1 and λ2) corresponding to the total variability explained by each principal component.
With these variables, a graphical representation was created for each item. Two orthogonal segments were centred on the mean of the coordinate data. The segments’ directions were driven by the eigenvectors of the PCA, and the length of each segment was defined as one standard deviation (σ1 and σ2) around the mean, which was calculated by extracting the square root of each eigenvalue, λ1 and λ2.
My Steps
#reproducable data
set.seed(1)
x<-rnorm(10,50,4)
y<-rnorm(10,50,7)
# Note my data is not perfectly distirbuted in this fashion
df<-data.frame(x,y) # this is my R matrix
covar.df<-cov(df,use="all.obs",method='pearson') # this is my covariance matrix
pca.results<-prcomp(covar.df) # this applies PCA to the covariance matrix
pca.results$sdev # these are the standard deviations of the principal components
# which is what I believe I am looking for.
This is where I am stuck because I am not sure if I am trying to get the sdev output form prcomp() or if I should scale my data first. They are all on the same scale, so I do not see the issue with it.
My second question is how do I extract the standard deviation in the x and y direciton?
You don't apply prcomp to the covariance matrix, you do it on the data itself.
result= prcomp(df)
If by scaling you mean normalize or standardize, that happens before you do prcomp(). For more information on the procedure see this link that is introductory to the procedure: pca on R. That can walk you through the basics. To get the sdev use the the summary on the result object
summary(result)
result$sdev
You don't apply prcomp to the covariance matrix. scale=T bases the PCA on the correlation matrix and F on the covariance matrix
df.cor = prcomp(df, scale=TRUE)
df.cov = prcomp(df, scale=FALSE)

Correlation between 2 rasters accounting for spatial autocorrelation

I want to test the correlation in the values between 2 spatial raster data sets (that perfectly overlap).
I could just do:
correlation(getValues(raster1), getValues(raster2))
but both raster datasets are spatial autocorrelated.
Instead, I am using:
modified.ttest(getValues(raster1), getValues(raster2), coordinates)
from the SpatialPack library.
This is based on Dutilleul's test that modifies that effective sample size based on the degree of autocorrelation.
However, the modified test does not change the estimated correlation coefficient, only the p-value.
How do I also correct the estimated correlation coefficient for the extent of autocorrelation?
This is more a stats than a programming question.
I do not think you can "correct the correlation coefficient for autocorrelation". The correlation coefficient is what it is. It is not affected by "oversampling".
a <- 1:10
b <- c(1:5,1:5)
cor(a,b)
#[1] 0.492366
No "inflation" when using the same values twice
cor(c(a,a),c(b,b))
#[1] 0.492366
The p-value is affected
t.test(a,b)$p.value
#[1] 0.03554967
t.test(c(a,a), c(b,b))$p.value
#[1] 0.002042504
You can adjust the p-value for oversampling. However, a question with raster data is whether you should indeed consider these as a sample. That depends on context, but raster data often represent the entire population (with some local averaging given that cells are discreet). If there is no uncertainty due to (a small) sample size, presenting a p-value is not meaningful.

Several Regressions_Modify the code

When I run the code below, I can calculate the regression's coefficients for each category of c. Now I was wondering how I can apply these estimated coefficients to calculate the residuals of all observations. For example, here just 25 observations belong to c=1, but I need to calculate the fitted values/Residuals of all 50 observations based on the estimated coefficients for this category.
A<-cars$speed
B<-cars$dist
c<-rep(1:2,25)
S<-data.frame(A,B,c)
library(plyr)
lmodel <- dlply(S,"c", function(d) lm(B~A, data = d))
I'm not 100% sure I understand what you mean, but the following code will give you a list of residuals. The first element of the list contains the residuals for all 50 observations using the coefficients for c=1 and the second for c=2.
residuals<- lapply(lmodel, function(x) B - coef(x)[1] - coef(x)[2]*A)

How to extract values fitted to a gaussian distribution in R?

I have a data frame X with 2 columns a and b, a is of class character and b is of class numeric.
I fitted a gaussian distribution using the fitdist (fitdistrplus package) function on b.
data.fit <- fitdist(x$b,"norm", "mle")
I want to extract the elements in column a that fall in the 5% right tail of the fitted gaussian distribution.
I am not sure how to proceed because my knowledge on fitting distribution is limited.
Do I need to retain the corresponding elements in column a for which b is greater than the value obtain for the 95%?
Or does the fitting imply that new values have been created for each value in b and I should use those values?
Thanks
by calling unclass(data.fit) you can see all the parts that make up the data.fit object, which include:
$estimate
mean sd
0.1125554 1.2724377
which means you can access the estimated mean and standard deviation via:
data.fit$estimate['sd']
data.fit$estimate['mean']
To calculate the upper 5th percentile of the fitted distribution, you can use the qnorm() function (q is for quantile, BTW) like so:
threshold <-
qnorm(p = 0.95,
mean=data.fit$estimate['mean'],
sd=data.fit$estimate['sd'])
and you can subset your data.frame x like so:
x[x$b > threshold,# an indicator of the rows to return
'a']# the column to return

Mahalanobis distance based classifier leads to seemingly wrong scores for points identical to training data

I have been using the mahal classifier function (Dismo package in r) in several of my analyses and recently I have discovered that it seems to give apparently wrong distance results for points that are identical to points used in training of the classifier. For background, from what I understand of mahalanobis-based classifiers, is that they use Mahalanobis distance to describe the similarity of a unclassified point by measuring the point's distance from the center of mass of the training set (while accounting for differences in scale and covariance, etc.). The mahalanobis distance score varies from –inf to 1, where one indicates no distance between the unclassified point and the centroid defined by the training set. However, I found that, for all points with identical predictor values than the training points, I still get a score of 1, as if the routine is working as a nearest neighbor classifier. This is a very troubling behavior because it has the potential to artificially increase the confidence of my overall classification.
Has anyone encountered this behavior? Any ideas on how to fix/ avoid this behavior?
I have written a small script below that showcases the odd behavior clearly:
rm(list = ls()) #remove all past worksheet variables
library(dismo)
logo <- stack(system.file("external/rlogo.grd", package="raster"))
#presence data (points that fall within the 'r' in the R logo)
pts <- matrix(c(48.243420, 48.243420, 47.985820, 52.880230, 49.531423, 46.182616,
54.168232, 69.624263, 83.792291, 85.337894, 74.261072, 83.792291, 95.126713,
84.565092, 66.275456, 41.803408, 25.832176, 3.936132, 18.876962, 17.331359,
7.048974, 13.648543, 26.093446, 28.544714, 39.104026, 44.572240, 51.171810,
56.262906, 46.269272, 38.161230, 30.618865, 21.945145, 34.390047, 59.656971,
69.839163, 73.233228, 63.239594, 45.892154, 43.252326, 28.356155), ncol=2)
# fit model
m <- mahal(logo, pts)
#using model, predict train data
training_vals=extract(logo, pts)
x <- predict(m, training_vals)
x #results show a perfect 1 prediction, which is highly unlikely
Now, I try to make predictions for values that are an average for directly adjacent point pairs
I do this because given that:
(1) each point for each pair used to train the model have a perfect suitability and
(2) that at least some of these average points are likely to be as close to the center of the mahalanobis centroid than the original pairs
(3) I would expect at least a few of the average points to have a perfect suitability as well.
#pick two adjacent points and fit model
adjacent_pts=pts
adjacent_pts[,2]=adjacent_pts[,2]+1
adjacent_training_vals=extract(logo, adjacent_pts)
new_pts=rbind(pts, adjacent_pts)
plot(logo[[1]]) #plot predictor raster and response point pairs
points(new_pts[,1],new_pts[,2])
#use model to predict mahalanobis score for new training data (point pairs)
m <- mahal(logo, new_pts)
new_training_vals=extract(logo, new_pts)
x <- predict(m, new_training_vals)
x
As expected from the odd behavior described, all training points have a distance score of 1. However, lets try to predict points that are an average of each pair:
mid_vals=(adjacent_training_vals+training_vals)/2
x <- predict(m, mid_vals)
x #NONE DO!
This for me is further indication that the Mahal routine will give a perfect score for any data point that has equal values to any of the points used to train the model
This below is uncessessary, but just another way to prove the point:
Here I predict the same original train data with a near insignificant 'budge' of values for only one of the predictors and show that the resulting scores change quite significantly.
mod_training_vals=training_vals
mod_training_vals[,1]=mod_training_vals[,1]*1.01
x <- predict(m, mod_training_vals)
x #predictions suddenly are far from perfect predictions

Resources