R: Compute only a band of a correlation matrix - r

I would like to compute correlations between columns of a matrix only for some band of the correlations matrix. I know how to get the whole correlation matrix:
X <- matrix(rnorm(20*30), nrow=20)
cor(X)
But as shown in the left figure below, I'm only interested in some band below the main diagonal.
I could try to cleverly subset the orginal matrix to get only the little squares shown in the right figure, but this seems to be cumbersome.
Do you have a better idea/solution to the problem.
EDIT
I forgot to mention this, but I can hardly use a for loop in R, since the dimension of the correlation matrix is rather large (about 2000*2000) and I have to do this process around 100 times.

You’re probably right that cor on the whole matrix is faster than using manual loops, since the internal workings of cor are highly optimised for matrices. But the bigger the matrix (and, conversely, the smaller the band), the more benefit you could reap from manually looping over the band.
That said, maybe just give it a try – the code for the manual loop is trivial:
cor_band = function (x, band_width, method = c('pearson', 'kendall', 'spearman')) {
out = matrix(nrow = ncol(x), ncol = ncol(x))
for (i in 1 : ncol(x))
for (j in i : min(i + band_width, ncol(x)))
out[j, i] = cor(x[, j], x[, i], method = method)
out
}
Note that the indices in out are reversed so that we get the band below the diagonal rather than above. Since the correlation matrix is symmetrical, either works.

Try a for loop :
band_cor_mat = matrix(NA, nrow=nrow(X), ncol=ncol(X))
for (cc in 1:ncol(X)) { # Diagonal
for (mm in 1:min(band_width, nrow(X)-cc)) { # Band
band_cor_mat[cc+mm,cc] = cor(X[,cc+mm], X[,cc])
}
}
You will have a correlation matrix, with correlation values in the band, and NAs for the rest.

Related

Implementing KNN with different distance metrics using R

I am working on a dataset in order to compare the effect of different distance metrics. I am using the KNN algorithm.
The KNN algorithm in R uses the Euclidian distance by default. So I wrote my own one. I would like to find the number of correct class label matches between the nearest neighbor and target.
I have prepared the data at first. Then I called the data (wdbc_n), I chose K=1. I have used Euclidian distance as a test.
library(philentropy)
knn <- function(xmat, k,method){
n <- nrow(xmat)
if (n <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n, ncol = k)
for(i in 1:n) {
ddist<- distance(xmat, method)
neigh[i, ] <- order(ddist)[2:(k + 1)]
}
return(neigh)
}
wdbc_nn <-knn(wdbc_n ,1,method="euclidean")
Hoping to get a similar result to the paper ("on the surprising behavior of distance metrics in high dimensional space") (https://bib.dbvis.de/uploadedFiles/155.pdf, page 431, table 3).
My question is
Am I right or wrong with the codes?
Any suggestions or reference that will guide me will be highly appreciated.
EDIT
My data (breast-cancer-wisconsin)(wdbc) dimension is
569 32
After normalizing and removing the id and target column the dimension is
dim(wdbc_n)
569 30
The train and test split is given by
wdbc_train<-wdbc_n[1:469,]
wdbc_test<-wdbc_n[470:569,]
Am I right or wrong with the codes?
Your code is wrong.
The call to the distance function taked about 3 seconds every time on my rather recent PC so I only did the first 30 rows for k=3 and noticed that every row of the neigh matrix was identical. Why is that? Take a look at this line:
ddist<- distance(xmat, method)
Each loop feeds the whole xmat matrix at the distance function, then uses only the first line from the resulting matrix. This calculates the distance between the training set rows, and does that n times, discarding every row except the first. Which is not what you want to do. The knn algorithm is supposed to calculate, for each row in the test set, the distance with each row in the training set.
Let's take a look at the documentation for the distance function:
distance(x, method = "euclidean", p = NULL, test.na = TRUE, unit =
"log", est.prob = NULL)
x a numeric data.frame or matrix (storing probability vectors) or a
numeric data.frame or matrix storing counts (if est.prob is
specified).
(...)
in case nrow(x) = 2 : a single distance value. in case nrow(x) > 2 :
a distance matrix storing distance values for all pairwise probability
vector comparisons.
In your specific case (knn classification), you want to use the 2 row version.
One last thing: you used order, which will return the position of the k largest distances in the ddist vector. I think what you want is the distances themselves, so you need to use sort instead of order.
Based on your code and the example in Lantz (2013) that your code seemed to be based on, here is a complete working solution. I took the liberty to add a few lines to make a standalone program.
Standalone working solution(s)
library(philentropy)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
knn <- function(train, test, k, method){
n.test <- nrow(test)
n.train <- nrow(train)
if (n.train + n.test <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n.test, ncol = k)
ddist <- NULL
for(i in 1:n.test) {
for(j in 1:n.train) {
xmat <- rbind(test[i,], train[j,]) #we make a 2 row matrix combining the current test and train rows
ddist[j] <- distance(as.data.frame(xmat), method, k) #then we calculate the distance and append it to the ddist vector.
}
neigh[i, ] <- sort(ddist)[2:(k + 1)]
}
return(neigh)
}
wbcd <- read.csv("https://resources.oreilly.com/examples/9781784393908/raw/ac9fe41596dd42fc3877cfa8ed410dd346c43548/Machine%20Learning%20with%20R,%20Second%20Edition_Code/Chapter%2003/wisc_bc_data.csv")
rownames(wbcd) <- wbcd$id
wbcd$id <- NULL
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
wbcd_train<-wbcd_n[1:469,]
wbcd_test<-wbcd_n[470:549,]
wbcd_nn <-knn(wbcd_train, wbcd_test ,3, method="euclidean")
Do note that this solution might be slow because of the numerous (100 times 469) calls to the distance function. However, since we are only feeding 2 rows at a time into the distance function, it makes the execution time manageable.
Now does that work?
The two first test rows using the custom knn function:
[,1] [,2] [,3]
[1,] 0.3887346 0.4051762 0.4397497
[2,] 0.2518766 0.2758161 0.2790369
Let us compare with the equivalent function in the FNN package:
library(FNN)
alt.class <- get.knnx(wbcd_train, wbcd_test, k=3, algorithm = "brute")
alt.class$nn.dist
[,1] [,2] [,3]
[1,] 0.3815984 0.3887346 0.4051762
[2,] 0.2392102 0.2518766 0.2758161
Conclusion: not too shabby.

R: backwards principal component calculation

I would like to perform a backwards principal component calculation in R, meaning: obtaining the original matrix by the PCA object itself.
This is an example case:
# Load an expression matrix
load(url("http://www.giorgilab.org/allexp_rsn.rda"))
# Calculate PCA
pca <- prcomp(t(allexp_rsn))
In order to obtain the original matrix, one should multiply the rotations by the PCA themselves, as such:
test<-pca$rotation%*%pca$x
However, as you may check, the calculated "test" matrix is completely different from the original "allexp_rsn" matrix. What am I doing wrong? Is the function prcomp adding something else to the svs procedure?
Thanks :-)
Using USArrests:
pca <- prcomp(t(USArrests))
out <- t(pca$x%*%t(pca$rotation))
out <- sweep(out, 1, pca$center, '+')
apply(USArrests - out, 2, sum)
Murder Assault UrbanPop Rape
1.070921e-12 -2.778222e-12 3.801404e-13 1.428191e-12
Remember that a prerequisite to perform PC analysis is to scale and center the data. I believe that prcomp procedure does that, so pca$x returns scaled original data (with mean 0 and std. equal to 1).
Here is a solution using the eigen function, applied to a B/W image matrix to illustrate the point. The function uses increasing numbers of PCs, but you can use all of them, or only some of them
library(gplots)
library(png)
# Download an image:
download.file("http://www.giorgilab.org/pictures/monalisa.tar.gz",destfile="monalisa.tar.gz",cacheOK = FALSE)
untar("monalisa.tar.gz")
# Read image:
img <- readPNG("monalisa.png")
# Dimension
d<-1
# Rotate it:
rotate <- function(x) t(apply(x, 2, rev))
centermat<-rotate(img[,,d])
# Plot it
image(centermat,col=gray(c(0:100)/100))
# Increasing PCA
png("increasingPCA.png",width=2000,height=2000,pointsize=20)
par(mfrow=c(5,5),mar=c(0,0,0,0))
for(end in (1:25)*12){
for(d in 1){
centermat<-rotate(img[,,d])
eig <- eigen(cov(centermat))
n <- 1:end
eigmat<-t(eig$vectors[,n] %*% (t(eig$vectors[,n]) %*% t(centermat)))
image(eigmat,col=gray(c(0:100)/100))
}
}
dev.off()

How to combine data from different columns, e.g. mean of surrounding columns for a given column

I am trying to smooth a matrix by attributing the mean value of a window covering n columns around a given column. I've managed to do it but I'd like to see how would be 'the R way' of doing it as I am making use of for loops. Is there a way to get this using apply or some function of the same family?
Example:
# create a toy matrix
mat <- matrix(ncol=200);
for(i in 1:100){ mat <- rbind(mat,sample(1:200, 200) )}
# quick visualization
image(t(mat))
This is the matrix before smoothing:
I wrote the function smooth_mat that takes a matrix and the length of the smoothing kernel:
smooth_row_mat <- function(k, k.d=5){
k.range <- (k.d + 2):(ncol(k) - k.d - 1)
k.smooth <- matrix(nrow=nrow(k))
for( i in k.range){
if (i %% 10 == 0) cat('\r',round(i/length(k.range), 2))
k.smooth <- cbind( k.smooth, rowMeans(k[,c( (i-1-k.d):(i-1) ,i, (i+1):(i + 1 - k.d) )]) )
}
return(k.smooth)
}
Now we use smooth_row_mat() with mat
mat.smooth <- smooth_mat(mat)
And we have successfully smoothed, on a row basis, the content of the matrix.
This is the matrix after:
This method is good for such a small matrix although my real matrices are around 40,000 x 400, still works but I'd like to improve my R skills.
Thanks!
You can apply a filter (running mean) across each row of your matrix as follows:
apply(k, 1, filter, rep(1/k.d, k.d))
Here's how I'd do it, with the raster package.
First, create a matrix filled with random data and coerce it to a raster object.
library(raster)
r <- raster(matrix(sample(200, 200*200, replace=TRUE), nc=200))
plot(r)
Then use the focal function to calculate a neighbourhood mean for a neighbourhood of n cells either side of the focal cell. The values in the matrix of weights you provide to the focal function determine how much the value of each cell contributes to the focal summary. For a mean, we say we want each cell to contribute 1/n, so we fill a matrix of n columns, with values 1/n. Note that n must be an odd number, and the cell in the centre of the matrix is considered the focal cell.
n <- 3
smooth_r <- focal(r, matrix(1/n, nc=n))
plot(smooth_r)

Fill matrix with loop

I am trying to create a matrix n by k with k mvn covariates using a loop.
Quite simple but not working so far... Here is my code:
n=1000
k=5
p=100
mu=0
sigma=1
x=matrix(data=NA, nrow=n, ncol=k)
for (i in 1:k){
x [[i]]= mvrnorm(n,mu,sigma)
}
What's missing?
I see several things here:
You may want to set the random seed for replicability (set.seed(20430)). This means that every time you run the code, you will get exactly the same set of pseudorandom variates.
Next, your data will just be independent variates; they won't actually have any multivariate structure (although that may be what you want). In general, if you want to generate multivariate data, you should use ?mvrnorm, from the MASS package. (For more info, see here.)
As a minor point, if you want standard normal data, you don't need to specify mu = 0 and sigma = 1, as those are the default values for rnorm().
You don't need a loop to fill a matrix in R, just generate as many values as you like and add them directly using the data= argument in the matrix() function. If you really were committed to using a loop, you should probably use a double loop, so that you are looping over the columns, and within each loop, looping over the rows. (Note that this is a very inefficient way to code in R--although I do things like that all the time ;-).
Lastly, I can't tell what p is supposed to be doing in your code.
Here is a basic way to do what you seem to be going for:
set.seed(20430)
n = 1000
k = 5
dat = rnorm(n*k)
x = matrix(data=dat, nrow=n, ncol=k)
If you really wanted to use loops you could do it like this:
mu = 0
sigma = 1
x = matrix(data=NA, nrow=n, ncol=k)
for(j in 1:k){
for(i in 1:n){
x[i,j] = rnorm(1, mu, sigma)
}
}
define the matrix first
E<-matrix(data=0, nrow=10, ncol=10);
run two loops to iterate i for rows and j for columns, mine is a exchangeable correlation structure
for (i in 1:10)
{
for (j in 1:10)
{
if (i==j) {E[i,j]=1}
else {E[i,j]=0.6}
}
};
A=c(2,3,4,5);# In your case row terms
B=c(3,4,5,6);# In your case column terms
x=matrix(,nrow = length(A), ncol = length(B));
for (i in 1:length(A)){
for (j in 1:length(B)){
x[i,j]<-(A[i]*B[j])# do the similarity function, simi(A[i],B[j])
}
}
x # matrix is filled
I was thinking in my problem perspective.

How do I repeat a calculation for each row of a matrix in R?

I am very very new to programming and R. I have tried to find an answer to my question, but part of the problem is I don't know exactly what to search.
I am trying to repeat a calculation (statistical distance) for each row of a matrix. Here is what I have so far:
pollution1 <-as.matrix(pollution[,5:6])
ss <- var(pollution1)
ssinv <- solve(ss)
xbar <- colMeans(pollution1)
t(pollution1[1,]-xbar)%*%ssinv%*%(pollution1[1,]-xbar)
This gets me only the first statistical distance, but I don't want to retype this line with a different matrix row to get all of them.
From what I have read, I may need a loop or to use apply(), but haven't had success on my own. Any help with this, and advice on how to search for help so I don't need to post, would be appreciated.
Thank you.
You might also consider the mahalanobis function: from ?mahalanobis,
Returns the squared Mahalanobis distance of all rows in ‘x’ and
the vector mu = ‘center’ with respect to Sigma = ‘cov’. This is
(for vector ‘x’) defined as
D^2 = (x - mu)' Sigma^-1 (x - mu)
Of course, it's good to learn how to use apply too ...
What about just using apply
apply(pollution1, 1, function(i) t(i-xbar) %*% ssinv %*% (i-xbar))
Also, it's helpful if you make your example reproducible, for example:
pollution1 = matrix(rnorm(100), ncol=2)
ss = var(pollution1)
ssinv = solve(ss)
xbar = colMeans(pollution1)
t(pollution1[1,]-xbar) %*% ssinv %*% (pollution1[1,]-xbar)

Resources