Creating a correlation matrix using novel distance function - r

I'm trying to write a function that will create a correlation matrix using a fancy distance estimate (dcorr, Brownian distance). More generally, I want to write code for a generic "correlation" matrix in which you can plug in any distance estimator.
My data is formatted such that columns are variables and rows are observations.
I'm having problems with my basic code. My algorithm is as follows:
Use apply to take a variable
Pass to function that will again take apply on the entire matrix
At this point you should have two pairs of variables
Use na.omit to remove missing observations (necessary for dcorr)
Calculate dcorr
I was hoping this would result in the correlation matrix but I'm having a lot of problems with basic variable managment. I'm having difficulty passing variables to the apply function. In particular, I want to pass a the column that was pulled in the first apply and pass it to the second apply (that is applied on the entire original matrix)
My code:
dcormatrix <- function(Matrix){
dcorhelper <- function (Col1){
as.matrix(apply(Matrix,2,function(Col2){
B <- na.omit(cbind(Col1,Col2))
dcor(B[,1],B[,2],index=1)
},Col1=Col1))
}
apply(Matrix,2,dcorhelper(),Matrix=Matrix)
}
Any ideas? I'm sure there's gotta be an easy way to do this.

You may want to check out designdist from the vegan package. It allows one to define alternate distance / dissimilarity matrices. See here.

Related

Calculating just a single row of dissimilarity/distance matrix

I have a data-frame with 30k rows and 10 features. I would like to calculate distance matrix like below;
gower_dist <- daisy(data-frame, metric = "gower"),
This function returns whole dissimilarity matrix. I want to get just the first row.
(Just distances of the first element in data-frame). How can I do it? Do you have an idea?
You probably need to get the source and extend it.
I suggest you extend the API by adding a second parameter y that defaults to x. Then the method should return the pairwise distances of each element in x to each element in y.
Fortunately, R is GPL open source, so this is easy.
This would likely be a welcome extension, you should submit it to the package authors for inclusion.

How to calculate distance from center of a hypersphere (whitened PCA)

Lets say that I have calculated the principal components of a reference data set including whitening. The transformation matrix created from the principal component vectors is then applies to a test data set, projecting it onto the subspace of PCs. Now, I should be able to measure the distance of each of the test data vectors from the center of the PC hypersphere by simply adding up the coefficients of each column. Is this correct? Applying this transformation to my reference data gives a length of zero for all columns, and the length of the vectors seems to decrease as I make the test data look more like the reference data and grows as I make the two sets more distinct.
Am I correct that I can judge "distance" in a multidimensional space in this way? Is it just the sum of the coefficients of the projected matrix?
Thanks very much for any insight you can provide.
Distance is not a linear sum, and it is never zero (outside the origin). It is computed as:
distance(x) = square_root( sum ( x(i)^2 ) )
If this was not what you where looking for, please expand your question and include some code and examples.

Cortest.mat converting to class "psych,sim"

I've been using the psych package to compare two correlation matrices using the function cortest.
Now I want to try the cortest.mat and cortest.jennrich function which require an object of the class phychand sim. I have tried converting mi correlation matrices with sim.structure which results in an object of such classes but I get an error when running either function.
Here is what I've tried using Random numbers:
Random<-cor(matrix(rnorm(400, 0, .25), nrow=(20), ncol=(20)))
SimRandom<-sim.structure(Random)
class(SimRandom)
cortest.jennrich(SimRandom,SimRandom,n1=400, n2=400)
Yields the following:
Error in if (dim(R1)[1] != p) { : argument is of length zero
I sure I'm doing it wrong 'cause of the error message and 'cause the values in Random and SimRandom are not exactly the same.
Which is the correct way to translate a correlation matrix to a type -phych, sim- to use as input for running cortest.mat?
Thanks in advance.
EDIT: Short explanation on what I want to do. Using Random numbers serves just as an example. The actual correlation matrices to compare are done as follows. I have a huge list of files each composed of 100 observations for a specific genetic location. These files can be grouped into say 20 files based on known genetic relationships, thus I use those groups of files, load them into a matrix as columns and calculate cor(). That gives a correlation matrix. As a control I load random files and treat them the same way. This matrix contains real data, but the grouping is done randomly. In the end I have two correlation matrices 1-That contains the correlations of pre-selected files and 2- that contains the correlations between randomly loaded files. Both matrices are the same size.
What I would like to do is to compare the two correlation matrices to have an idea whether the grouping has an influence on the correlation values observed.
Sorry for not explaining this earlier, I wanted to avoid the long explanation and keep the question simple.

Taylor diagram from existing Correlation and Standard Dev values

Is it possible to create a Taylor diagram from already calculated correlation and standard deviation values?
I am doing model evaluation, and I have already the correlation and standard deviations values.I understand that there is already a package plotrix where by giving the observation and the modeled values, the diagram is created. However for the type of work that I am doing, it is easier to start by giving already the correlation and standard deviation values.
Is there any way I can do this in R?
There's no reason it shouldn't be possible, but the authors didn't seem to allow for that when they wrote the function. The function is a bit long and complex, but the part that does the calculation is at the top. It is possible to swap out that code and replace it to allow for the passing of summary statistics. Now, keep in mind what i'm about to do is a hack and i've only tested it with versions 3.5-5 of plotrix. Other version may not work.
Here will will create a new function taylor.diagram2 that takes all the code from taylor.diagram but adds in an extra if statement to check for a list of summarized data as the first argument
taylor.diagram2<-taylor.diagram
bl<-as.list(body(taylor.diagram))
cond<-list(
as.name("if"),
quote(is.list(ref) & missing(model)), #condition
quote({R<-ref$R; sd.r<-ref$sd.r; sd.f<-ref$sd.f}), #if true
as.call(c(as.symbol("{"), bl[3:8]))) #else
bl<-c(bl[1:2], as.call(cond), bl[9:length(bl)]) #splice in new code
body(taylor.diagram2)<-as.call(bl) #update function
Now we can test the function. First, we'll do things the standard way
#test data
aref<-rnorm(30,sd=2)
amodel1<-aref+rnorm(30)/2
#standard behavior function
taylor.diagram2(aref,amodel1, main="Standard Behavior"))
#summarized data
xx<-list(
R=cor(aref, amodel1, use = "pairwise"),
sd.r=sd(aref),
sd.f=sd(amodel1)
)
#modified behavior
taylor.diagram2(xx, main="Modified Behavior")
So the new taylor.diagram2 function can do both. If you pass it two vectors, it will do the standard behavior. If you pass it a list with the names R, sd.r, and sd.f, then it will do the same plot but with the values you passed in. Also, the model parameter must be empty for the modified version to work. That means if you want to set any additional parameter, you must use named parameters rather than positional arguments.

Combining ROCR performance objects

I have multiple performance objects created using ROCR. Each of these contain auc or fpr/tpr values for a class. In turn they have results for multiple test runs. So,
length(first.perf.obj#y.values)
gives something > 1.
I can plot average for a single class using
plot(first.perf.obj, avg="vertical")
as described in the ROCR manual. I want to combine these objects to calculate and plot their global average. Something like
global.perf.obj <- combine.perf.objects(first.perf.obj, second.perf.obj, third.perf.obj)
Is there an easy way to do this, or should I decompose each object and calculate values by hand?
I went back recreating prediction objects for the global case.
I'm calling the prediction function like
global.prediction <- prediction(c(cls1.likelihood,
cls2.likelihood,
cls3.likelihood,
cls4.likelihood,
cls5.likelihood),
c(duplicate.cols(cls1.labels, ncol(cls1.likelihood)),
duplicate.cols(cls2.labels, ncol(cls2.likelihood)),
duplicate.cols(cls3.labels, ncol(cls3.likelihood)),
duplicate.cols(cls4.labels, ncol(cls4.likelihood)),
duplicate.cols(cls5.labels, ncol(cls5.likelihood))),
label.ordering=c(FALSE, TRUE))
for duplicate.cols simply builds a data.frame of repeating labels.
Then I'm able to get any statistic for the global case by e.g. performance(global.prediction, "auc")
It's a bit slow, but I think it's simpler than trying to combine values from multiple performance objects.

Resources