applying the pvclust R function to a precomputed dist object - r

I'm using R to perform an hierarchical clustering. As a first approach I used hclust and performed the following steps:
I imported the distance matrix
I used the as.dist function to transform it in a dist object
I run hclust on the dist object
Here's the R code:
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
hclust(d, "ward")
At this point I would like to do something similar with the function pvclust; however, I cannot because it's not possible to pass a precomputed dist object. How can I proceed considering that I'm using a distance not available among those provided by the dist function of R?

I've tested the suggestion of Vincent, you can do the following (my data set is a dissimilarity matrix):
# Import you data
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
# Compute the eigenvalues
x <- cmdscale(d,1,eig=T)
# Plot the eigenvalues and choose the correct number of dimensions (eigenvalues close to 0)
plot(x$eig,
type="h", lwd=5, las=1,
xlab="Number of dimensions",
ylab="Eigenvalues")
# Recover the coordinates that give the same distance matrix with the correct number of dimensions
x <- cmdscale(d,nb_dimensions)
# As mentioned by Stéphane, pvclust() clusters columns
pvclust(t(x))

If the dataset is not too large, you can embed your n points in a space of dimension n-1, with the same distance matrix.
# Sample distance matrix
n <- 100
k <- 1000
d <- dist( matrix( rnorm(k*n), nc=k ), method="manhattan" )
# Recover some coordinates that give the same distance matrix
x <- cmdscale(d, n-1)
stopifnot( sum(abs(dist(x) - d)) < 1e-6 )
# You can then indifferently use x or d
r1 <- hclust(d)
r2 <- hclust(dist(x)) # identical to r1
library(pvclust)
r3 <- pvclust(x)
If the dataset is large, you may have to check how pvclust is implemented.

It's not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by #Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on another question on CrossValidated). In the latter case, the authors of pvclust provide an example on how to use a custom distance function, although that means you will have to install their "unofficial version".

Related

Calculate Euclidean distance between multiple pairs of points in dataframe in R

I'm trying to calculate the Euclidean distance between pairs of points in a dataframe in R, and there's an ID for each pair:
ID <- sample(1:10, 10, replace=FALSE)
P <- runif(10, min=1, max=3)
S <- runif(10, min=1, max=3)
testdf <- data.frame(ID, P, S)
I found several ways to calculate the Euclidean distance in R, but I'm either getting an error, returning only 1 value (so it's computing the distance between the entire vector), or I end up with a matrix when all I need is a 4th column with the distance between each pair (columns 'P' and 'S.') I'm a bit confused by matrices so I'm not sure how to work with that result.
Tried making a function and applying it to the 2 columns but I get an error:
testdf$V <- apply(testdf[ , c('P', 'S')], 1, function(P, S) sqrt(sum((P^2, S^2)))
# Error in FUN(newX[, i], ...) : argument "S" is missing, with no default
Then tried using the dist() function in the stats package but it only returns 1 value:
(Same problem if I follow the method here: https://www.statology.org/euclidean-distance-in-r/)
P <- testdf$P
S <- testdf$S
testProbMatrix <- rbind(P, S)
stats::dist(testProbMatrix, method = "euclidean")
# returns only 1 distance
Returns a matrix
(Here's a nice explanation why: Calculate the distances between pairs of points in r)
stats::dist(cbind(P, S), method = "euclidean")
But I'm confused how to pull the distances out of the matrix and attach them to the correct ID for each pair of points. I don't understand why I have to make a matrix instead of just applying the function to the dataframe - matrices have always confused me.
I think this is the same question as here (Finding euclidean distance between all pair of points) but for R instead of Python
Thanks for the help!
Try this out if you would just like to add another column to your dataframe
testdf$distance <- sqrt((P^2 + S^2))

Principal component analysis using R. Automatic and manual results do not match

Two different methods of the principal component analysis were conducted to analyze the following data (ch082.dat) using the Box1's R-code, below.https://drive.google.com/file/d/1xykl6ln-bUnXIs-jIA3n5S3XgHjQbkWB/view?usp=sharing
The first method uses the rotation matrix (See 'ans_mat' under the '#rotated data' of the Box1's code) and,
the second method uses the 'pcomp' function (See 'rpca' under the '#rotated data' of the Box1's code).
However, there is a subtle discrepancy in the answer between the method using the rotation matrix and the method using the 'pcomp' function.
make it match
My Question
What should I do so that the result of the rotation matrix -based method matches the result of the'pcomp' function?
As far as I've tried with various data, including other data, the actual discrepancies seem to be limited to scale shifts and mirroring transformations.
The results of the rotation matrix -based method is shown in left panel.
The results of the pcomp function -based method is shown in right panel.
Mirror inversion can be seen in "ch082.dat" data.(See Fig.1);
It seems that, in some j, the sign of the "jth eigenvector of the correlation matrix" and the sign of the "jth column of the output value of the prcomp function" may be reversed. If there is a degree of overlap in the eigenvalues, it is possible that the difference may be more complex than mirror inversion.
Fig.1
There is a scale shift for the Box2's data (See See Fig.2), despite the centralization and normalization to the data.
Fig.2
Box.1
#dataload
##Use the 'setwd' function to specify the directory containing 'ch082.dat'.
##For example, if you put this file directly under the C drive of your Windows PC, you can run the following command.
setwd("C:/") #Depending on where you put the file, you may need to change the path.
getwd()
w1<-read.table("ch082.dat",header = TRUE,row.names = 1,fileEncoding = "UTF-8")
w1
#Function for standardizing data
#Thanks to https://qiita.com/ohisama2/items/5922fac0c8a6c21fcbf8
standalize <- function(data)
{ for(i in length(data[1,]))
{
x <- as.matrix(data[,i])
y <- (x-mean(x)/sd(x))
data[,i] <- y
}
return(data)}
#Method using rotation matrix
z_=standalize(w1)
B_mat=cor(z_) #Compute correlation matrix
eigen_m <- eigen(B_mat)
sample_mat <- as.matrix(z_)
ans_mat=sample_mat
for(j in 1:length(sample_mat[1,])){
ans_mat[,j]=sample_mat%*%eigen_m$vectors[,j]
}
#Method using "rpca" function
rpca <- prcomp(w1,center=TRUE, scale=TRUE)
#eigen vectors
eigen_m$vectors
rpca
#rotated data
ans_mat
rpca$x
#Graph Plots
par(mfrow=c(1,2))
plot(
ans_mat[,1],
ans_mat[,2],
main="Rotation using eigenvectors"
)
plot(rpca$x[,1], rpca$x[,2],
main="Principal component score")
par(mfrow=c(1,1))
#summary
summary(rpca)$importance
Box2.
sample_data <- data.frame(
X = c(2,4, 6, 5,7, 8,10),
Y = c(6,8,10,11,9,12,14)
)
X = c(2,4, 6, 5,7, 8,10)
Y = c(6,8,10,11,9,12,14)
plot(Y ~ X)
w1=sample_data
Reference
https://logics-of-blue.com/principal-components-analysis/
(Written in Japanease)
The two sets of results agree. First we can simplify your code a bit. You don't need your function or the for loop:
z_ <- scale(w1)
B_mat <- cor(z_)
eigen_m <- eigen(B_mat)
ans_mat <- z_ %*% eigen_m$vectors
Now the prcomp version
z_pca <- prcomp(z_)
z_pca$sdev^2 # Equals eigen_m$values
z_pca$rotation # Equals eigen_m$vectors
z_pca$x # Equals ans_mat
Your original code mislabeled ans_mat columns. They are actually the principal component scores. You can fix that with
colnames(ans_mat) <- colnames(z_pca$x)
The pc loadings (and therefore the scores) are not uniquely defined with respect to reflection. In other words multiplying all of the loadings or scores in one component by -1 flips them but does not change their relationships to one another. Multiply z_pca$x[, 1] by -1 and the plots will match:
z_pca$x[, 1] <- z_pca$x[, 1] * -1
dev.new(width=10, height=6)
par(mfrow=c(1,2))
plot(ans_mat[,1], ans_mat[,2], main="Rotation using eigenvectors")
plot(z_pca$x[,1], z_pca$x[,2], main="Principal component score")

R - Different approach to speed up 3 dimension array/matrix creation

My question is one of approach. Using SO I iterated through methods to create a 3 dimension array in R (this is my first question; R is a constraint). The use case is that this final array needs to be updated often but the two input arrays are updated at different periods. The goal is to minimize the final array creation time, but also intermediary steps if possible.
I know I can reach out with Rcpp, and I assign more than I need to for readability, but what I am wondering is:
Is there a better approach to completing this operation?
if (!require("geosphere")) install.packages("geosphere")
#simulate real data
dimLength <- 418
latLong <- cbind(rep(40,418),rep(2,418))
potentialChurn <- as.matrix(rep(500,418))
#create 2D matrix
valueMat <- matrix(0,dimLength,dimLength)
value <- potentialChurn
valueTranspose <- t(value)
for (s in 1:dimLength){valueMat[s,] <- value + valueTranspose[s]}
diag(valueMat) <- 0
#create 3D matrix from copying 2D matrix
bigValMat <- array(0,dim=c(dimLength,dimLength,dimLength))
for (d in 1:dimLength){bigValMat[,d,] <- valueMat}
#get crow fly distance between locations, create 2D matrix
distMat <- as.matrix(outer(seq(dimLength), seq(dimLength), Vectorize(function(i, j) distCosine(latLong[i,], latLong [j,]))))
###create 3D matrix by calculating distance between any two locations;
# create 2D matrix from each column in original 2D matrix
# add this column-replicated 2D matrix to the original
bigDistMat <- array(0,dim=c(dimLength,dimLength,dimLength))
for (p in 1:dimLength){
addCol <- distMat[,p]
addMatrix <- as.matrix(addCol)
for (y in 2:dimLength) {addMatrix <- cbind(addMatrix,addCol)}
bigDistMat[,p,] <- data.matrix(distMat) + data.matrix(addMatrix)}
#Final matrix calculation
bigValDistMat <- bigValMat / bigDistMat
...as context this is part of a two step ahead forecast policy developed for a class using Barcelona Bikesharing (Bicing) data. The project is over and I am interested how I could have done better.
In general if you want to speed up your code you want to identify bottle necks and fix them like explained here. Putting all your code before hand in a function would
Be a good idea.
In your specific case, you use much too much for loops for an R code. You need to vectorize your code much more.
Edit
Now for the long answer:
#simulate real data, you want them to be random
dimLength <- 418
latLong <- cbind(rnorm(dimLength,40,0.5),rnorm(dimLength,2,0.5))
potentialChurn <- as.matrix(rnorm(dimLength,500,10))
#create 2D matrix, outer is designed for this operation
valueMat <- outer(value,t(value),FUN="+")[,1,1,]
diag(valueMat) <- 0
# create 3D matrix from copying 2D matrix, again, avoid for loop
bigValMat <- array(rep(valueMat,dimLength),dim=c(dimLength,dimLength,dimLength))
# and use aperm to permute the dimensions
bigValMat <- aperm(bigValMat2,c(1,3,2))
#get crow fly distance between locations, create 2D matrix
# other packages are available to compute that kind of distance matrix
# but let's stay in plain R
# wordy but so much faster (and easier to read)
longs1 <- rep(latLong[,1],dimLength)
lats1 <- rep(latLong[,2],dimLength)
latLong1 <- cbind(longs1,lats1)
longs2 <- rep(latLong[,1],each=dimLength)
lats2 <- rep(latLong[,2],each=dimLength)
latLong2 <- cbind(longs2,lats2)
distMat <- matrix(distCosine(latLong1,latLong2),ncol=dimLength)
###create 3D matrix by calculating distance between any two locations;
# same logic than for bigValMat
addMatrix <- array(rep(distMat,dimLength),dim=rep(dimLength,3))
distMat3D <- aperm(addMatrix,c(1,3,2))
bigDistMat <- addMatrix + distMat3D
#get crow fly distance between locations, create 2D matrix
#Final matrix calculation
bigValDistMat <- bigValMat / bigDistMat
Here it is 25x faster than your initial code (76s -> 3s). It could still be much improved but you got the idea: avoid for and cbind and co at all costs.

R: backwards principal component calculation

I would like to perform a backwards principal component calculation in R, meaning: obtaining the original matrix by the PCA object itself.
This is an example case:
# Load an expression matrix
load(url("http://www.giorgilab.org/allexp_rsn.rda"))
# Calculate PCA
pca <- prcomp(t(allexp_rsn))
In order to obtain the original matrix, one should multiply the rotations by the PCA themselves, as such:
test<-pca$rotation%*%pca$x
However, as you may check, the calculated "test" matrix is completely different from the original "allexp_rsn" matrix. What am I doing wrong? Is the function prcomp adding something else to the svs procedure?
Thanks :-)
Using USArrests:
pca <- prcomp(t(USArrests))
out <- t(pca$x%*%t(pca$rotation))
out <- sweep(out, 1, pca$center, '+')
apply(USArrests - out, 2, sum)
Murder Assault UrbanPop Rape
1.070921e-12 -2.778222e-12 3.801404e-13 1.428191e-12
Remember that a prerequisite to perform PC analysis is to scale and center the data. I believe that prcomp procedure does that, so pca$x returns scaled original data (with mean 0 and std. equal to 1).
Here is a solution using the eigen function, applied to a B/W image matrix to illustrate the point. The function uses increasing numbers of PCs, but you can use all of them, or only some of them
library(gplots)
library(png)
# Download an image:
download.file("http://www.giorgilab.org/pictures/monalisa.tar.gz",destfile="monalisa.tar.gz",cacheOK = FALSE)
untar("monalisa.tar.gz")
# Read image:
img <- readPNG("monalisa.png")
# Dimension
d<-1
# Rotate it:
rotate <- function(x) t(apply(x, 2, rev))
centermat<-rotate(img[,,d])
# Plot it
image(centermat,col=gray(c(0:100)/100))
# Increasing PCA
png("increasingPCA.png",width=2000,height=2000,pointsize=20)
par(mfrow=c(5,5),mar=c(0,0,0,0))
for(end in (1:25)*12){
for(d in 1){
centermat<-rotate(img[,,d])
eig <- eigen(cov(centermat))
n <- 1:end
eigmat<-t(eig$vectors[,n] %*% (t(eig$vectors[,n]) %*% t(centermat)))
image(eigmat,col=gray(c(0:100)/100))
}
}
dev.off()

using k-NN in R with categorical values

I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.
I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance.
Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?
I'm using R 2.15
As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.
# Generate dummy data
y <- rep(1:2, each=50) # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5 # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)
# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)
# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
tab <- table(y[design.set][nn])
as.integer(names(tab)[which.max(tab)]) # This is a pretty dirty line
}
# Inspect the results
table(pred, y[test.set])
If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.
The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.

Resources