NMi and Jaccard measure in R - r

I am trying to find jaccard and NMI indexes from R using NMI package and cluster package of R.the thng is i am getting wrong indexes as the primary data conatins 3 clusters whereas test data has 5 clusters and 2 clusters respectively. Is there any way that NMI and jaccard indexes can be correctly calculated for data like this. following is the code which ia m presently using
clarax <- clara(a,3)
clV1 <- clarax$clustering
clara1 <- clara(e,5)
clV2 <- clara1$clustering
ci <- cindex(clV1=clV1, clV2=clV2, self=FALSE, minSZ=1, method="jaccard")
ci[2:3] # Returns Jaccard index and variables used to compute it
ci$intersects
a4<-ci$Jaccard_Index
sprintf("%.16f", a4)
clarax <- clara(a,3)
clV1 <- clarax$clustering
clara1 <- clara(f,2)
clV2 <- clara1$clustering
ci <- cindex(clV1=clV1, clV2=clV2, self=TRUE, minSZ=1, method="jaccard")
ci[2:3] # Returns Jaccard index and variables used to compute it
ci$intersects
a5<-ci$Jaccard_Index
sprintf("%.16f", a5)
and for NMI
b4<-NMI(groundtruth,clustering_4)
sprintf("%.16f", b4)
here is the link to dataset
http://www.mediafire.com/file/brar5msd12j552z/clustering_4.txt
http://www.mediafire.com/file/xd12xkd0h4jx6my/clustering_5.txt
http://www.mediafire.com/file/bkwat6edkcyp4wj/partitions.txt
the partition dataset is the primary dataset and we have to find the indexes based on these values.
Please help me out
Regards

I don't really see the problem here. NMI doesn't require the same number of clusters. Using the NMI package (as i don't find it very clear what you used in your question), it's as simple as this:
library(NMI)
NMI(partitions, clustering4)

Related

Using popbio package to calculate a population projection correctly?

So I have been working through a population ecology exercise using the popbio package in R-Studio that focuses on using Leslie Matrix's. I have successfully created a Leslie matrix with the proper dimensions using the Fecundity (mx) and Annual Survival values (sx) that I have calculated with my life table. I then am trying to use the pop.projection function in the popbio package to multiply my Leslie matrix (les.mat) by a starting population vector (N0) followed by the number of time intervals (4 years). It is my understanding that you should be able to take a Leslie matrix and multiply by a population vector to calculate a population size after a set number of time intervals. Have I done something wrong here, when I try to run my pop.projection line of code I get the following error message in R:
"> projA <- pop.projection(les.mat,N0,10)
Error in A %*% n : non-conformable arguments"
Could the problem be an issue with my pop.projection function? I am thinking it may be an issue with by N0 argument (population vector), when I look at my N0 values it seems like it has been saved in R as a "Numeric Type", should I be converting it into its own matrix, or as it's own vector somehow to get my pop.projection line of code to run? Any advice would be greatly appreciated, the short code I have been using will be linked below!
Sx <- c(0.8,0.8,0.7969,0.6078,0.3226,0)
mx <- c(0,0,0.6,1.09,0.2,0)
Fx <- mx # fecundity values
S <- Sx # dropping the first value
F <- Fx
les.mat <- matrix(rep(0,36),nrow=6)
les.mat[1,] <- F
les.mat
for(i in 1:5){
les.mat[(i+1),i] <- S[i]
}
les.mat
N0 <- c(100,80,64,51,31,10,0)
projA <- pop.projection(les.mat,N0,10)
The function uses matrix multiplication on the first and second arguments so they must match. The les.mat matrix is 6x6, but N0 is length 7. Try
projA <- pop.projection(les.mat, N0[-7], 10) # Delete last value
or
projA <- pop.projection(les.mat, N0[-1], 10) # Delete first value

Function to calculate Euclidean distance in R

I am trying to implement KNN classifier in R from scratch on iris data set and as a part of this i have written a function to calculate the Euclidean distance. Here is my code.
known_data <- iris[1:15,c("Sepal.Length", "Petal.Length", "Class")]
unknown_data <- iris[16,c("Sepal.Length", "Petal.Length")]
# euclidean distance
euclidean_dist <- function(k,unk) {
distance <- 0
for(i in 1:nrow(k))
distance[i] <- sqrt((k[,1][i] - unk[,1][i])^2 + (k[,2][i] - unk[,2][i])^2)
return(distance)
}
euclidean_dist(known_data, unknown_data)
However, when i call the function it's returning the first value correctly and rest as NA.
Could anyone show where i could have gone wrong with the code?
Thanks in advance.
The aim is to calculate the distance between the ith row of known_data, and the single unknown_data point.
How to fix your code
When you calculate distance[i], you're trying to access the ith row of the unknown data point, which doesn't exits, and is hence NA. I believe your code should run fine if you make the following edits:
known_data <- iris[1:15,c("Sepal.Length", "Petal.Length", "Class")]
unknown_data <- iris[16,c("Sepal.Length", "Petal.Length")]
# euclidean distance
euclidean_dist <- function(k,unk) {
# Make distance a vector [although not technically required]
distance <- rep(0, nrow(k))
for(i in 1:nrow(k))
# Change unk[,1][i] to unk[1,1] and similarly for unk[,2][i]
distance[i] <- sqrt((k[,1][i] - unk[1,1])^2 + (k[,2][i] - unk[1,2])^2)
return(distance)
}
euclidean_dist(known_data, unknown_data)
One final note - in the version of R I'm using, the known dataset uses a Species as opposed to Class column
An alternative method
As suggested by #Roman Luštrik, the entire aim of getting the Euclidean distances can be achieved with a simple one-liner:
sqrt((known_data[, 1] - unknown_data[, 1])^2 + (known_data[, 2] - unknown_data[, 2])^2)
This is very similar to the function you wrote, but does it in vectorised form, rather than through a loop, which is often a preferable way of doing things in R.
The best and fastst way is using h2o package:
#load library
library(h2o)
#initialize the node
h2o.init()
#transform the df to h2o type
known_data<-as.h2o(known_data)
unknown_data<-as.h2o(unknown_data)
#create a matrix in which the distances are going to be record
matrix1<-h2o.createFrame(rows=nrow(known_data),cols=unknown_data)
#do a loop to calculate the distance between all the rows of both df
for(i in 1:nrow(unknown_data)){
matrix[,i]<-as.data.frame(h2o.distance(known_data, unknown_data[i,],"l2"))
}

How to find significant correlations in a large dataset

I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?
You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:
In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.
Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.
If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator
One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.

using k-NN in R with categorical values

I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.
I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance.
Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?
I'm using R 2.15
As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.
# Generate dummy data
y <- rep(1:2, each=50) # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5 # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)
# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)
# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
tab <- table(y[design.set][nn])
as.integer(names(tab)[which.max(tab)]) # This is a pretty dirty line
}
# Inspect the results
table(pred, y[test.set])
If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.
The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.

applying the pvclust R function to a precomputed dist object

I'm using R to perform an hierarchical clustering. As a first approach I used hclust and performed the following steps:
I imported the distance matrix
I used the as.dist function to transform it in a dist object
I run hclust on the dist object
Here's the R code:
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
hclust(d, "ward")
At this point I would like to do something similar with the function pvclust; however, I cannot because it's not possible to pass a precomputed dist object. How can I proceed considering that I'm using a distance not available among those provided by the dist function of R?
I've tested the suggestion of Vincent, you can do the following (my data set is a dissimilarity matrix):
# Import you data
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
# Compute the eigenvalues
x <- cmdscale(d,1,eig=T)
# Plot the eigenvalues and choose the correct number of dimensions (eigenvalues close to 0)
plot(x$eig,
type="h", lwd=5, las=1,
xlab="Number of dimensions",
ylab="Eigenvalues")
# Recover the coordinates that give the same distance matrix with the correct number of dimensions
x <- cmdscale(d,nb_dimensions)
# As mentioned by Stéphane, pvclust() clusters columns
pvclust(t(x))
If the dataset is not too large, you can embed your n points in a space of dimension n-1, with the same distance matrix.
# Sample distance matrix
n <- 100
k <- 1000
d <- dist( matrix( rnorm(k*n), nc=k ), method="manhattan" )
# Recover some coordinates that give the same distance matrix
x <- cmdscale(d, n-1)
stopifnot( sum(abs(dist(x) - d)) < 1e-6 )
# You can then indifferently use x or d
r1 <- hclust(d)
r2 <- hclust(dist(x)) # identical to r1
library(pvclust)
r3 <- pvclust(x)
If the dataset is large, you may have to check how pvclust is implemented.
It's not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by #Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on another question on CrossValidated). In the latter case, the authors of pvclust provide an example on how to use a custom distance function, although that means you will have to install their "unofficial version".

Resources