Comparing 2 Cluster solutions using k-means clustering - r

I am experimenting with clustering in R for the first time and have been looking at the basic R help online and tried to compare the outcome of 2 cluster solutions.
I copied and pasted the script being careful to make sure that I had named the relevant data sets correctly first, but keep getting an error message that i don't understand.
Any ideas?
The script is simply:
comparing 2 cluster solutions
library(fpc)
cluster.stats (d, fit1$cluster, fit2$cluster)
and the error message I am getting is:
> library(fpc)
> cluster.stats(d, fit1$cluster, fit2$cluster)
Error in as.matrix.dist(d) :
length of 'dimnames' [1] not equal to array extent
In addition: Warning messages:
1: In as.dist.default(d) : NAs introduced by coercion
2: In as.dist.default(d) : non-square matrix
3: In as.matrix.dist(d) :
number of items to replace is not a multiple of replacement length
Thanks

the d object should contain a matrix of distances (usually symetrical matrix with zeroes on diagonal). in R can obtain the distance matrix using
d <- dist(clustering_result)

Related

K-Means clustering in R error NA/NaN/Inf in foreign function call

I have a dataset that I have created in R. It is structured as follows:
I am trying to cluster the observations using k-means. However, I get the following error message:
> cl <- kmeans(sample, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
What does this mean? Am I prepocessing the data incorrectly? What can I do to fix it?
In the documentation of kmeans (pass ?kmeans in the console to see it), it is stipulated that the argument x has to be:
numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
Here, you have the first row that is preventing to be used for kmeans. Basically, I believed that your first row is supposed to be your colnames.
Moreover, you can't make clustering with your second columns genre as it is character and I believed that the first column does not have to be used also, am I right ?
So, if your dataset is called samples, try to do:
colnames(samples) <- samples[1,]
samples_cluster <- samples[-1,3:ncol(samples)]
cl <- kmeans(samples_cluster,3)
Does it answer your question ?
If not, can you provide a reproducible example of your dataset in order we can verify the dataframe for kmeans clustering. To do this, please see: How to make a great R reproducible example

Finding correlation with p-value in R using Hmisc package

I am trying to find the correlation between variables of a dataframe in R. My head of the dataframe is below.
> head(datafile)
Taxon Petals Internode Sepal Bract Petiole Leaf Fruit
1 I 5.621498 29.48060 2.462107 18.20341 11.27910 1.128033 7.876151
2 I 4.994617 28.36025 2.429321 17.65205 11.04084 1.197617 7.025416
3 I 4.767505 27.25432 2.570497 19.40838 10.49072 1.003808 7.817479
4 I 6.299446 25.92424 2.066051 18.37915 11.80182 1.614052 7.672492
5 I 6.489375 25.21131 2.901583 17.31305 10.12159 1.813333 7.758443
6 I 5.785868 25.52433 2.655643 17.07216 10.55816 1.955524 7.880880
The code I'm using to find the data frame is below
#correlation
install.packages("Hmisc")
library(Hmisc)
rcorr(as.matrix(datafile))
I'm getting the following error when i tried that.
> rcorr(as.matrix(datafile))
Error in rcorr(as.matrix(datafile)) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
kindly help.
Yes, Just like the comment said. I had to exclude the variable Taxon, which is not numeric.

Error in huge R package when criterion "stars"

I am trying to do an association network using some expression data I have, the data is really huge: 300 samples and ~30,000 genes. I would like to apply a Gaussian graphical model to my data using the huge R package.
Here is the code I am using
dim(data)
#[1] 317 32291
huge.out <- huge.npn(data)
huge.stars <- huge.select(huge.out, criterion="stars")
However in this last step I got an error:
Error in cor(x) : ling....in progress:10%
Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'
Any help would be very appreciated.
You posted this exact question on Rhelp today. Both SO and Rhelp deprecate cross-posting but if you do choose to switch venues it is at the very least courteous to inform the readership.
You responded to the suggestion here on SO that there were missing data in your data-object named 'data' by claiming there were no missing data. So what does this code return:
lapply(data , function(x) sum(is.na(x)))
That would be a first level check, but there could also be an error caused by a later step that encountered a missing value in the matrix of correlation coefficients in the matrix 'huge.out". That could happen if there were: a) infinities in the calculations or b) if one of the columns were constant:
> cor(c(1:10,Inf), 1:11)
[1] NaN
> cor(rep(2,7), rep(2,7))
[1] NA
Warning message:
In cor(rep(2, 7), rep(2, 7)) : the standard deviation is zero
So the next check is:
sum( is.na(huge.out) )
That will at least give you some basis for defending your claim of no missings and will also give you a plausible theory as to the source of the error. To locate a column that is entirely constant you might do something like this (assuming it were a dataframe):
which(sapply(sapply(data, unique), length) > 1)
If it's a matrix, you need to use apply.

R skmeans package - where does this error come from: "missing value where TRUE/FALSE needed"

I tried to cluster my data in accordance with the manual provided by the skmeans packages's manual page
I started by installing all required packages.
I then imported my data table, and made a matrix out of it with:
x <- as.matrix(x)
# See dimensions
dim(x)
[1] 184 4000
When I try to hard partition my data into 5 clusters - as it is done in the manual's first example - like so:
hparty <- skmeans(x, 5, control = list(verbose = TRUE))
I receive the following error message:
Error in if (!all(row_norms(x) > 0)) stop("Zero rows are not allowed.") :
missing value where TRUE/FALSE needed
And when I just type:
test <- skmeans(x, 5)
I get:
Error in skmeans(x, 5) : Zero rows are not allowed.
I'm trying to figure out where this error is coming from, and why the function can't get a TRUE/FALSE value. Has anyone ever experienced this problem?
Thank you in advance!
Spherical means is k-means where every vector is normalized to length 1.
If you have a constant 0 vector, this is not possible, and you cannot use spherical k-means (or cosine similarity).
!all(row_norms(x) > 0))
is the test that you do not have a row of length 0.

Error with knn function

I try to run this line :
knn(mydades.training[,-7],mydades.test[,-7],mydades.training[,7],k=5)
but i always get this error :
Error in knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
2: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
Any idea please ?
PS : mydades.training and mydades.test are defined as follow :
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
I suspect that your issue lies in having non-numeric data fields in 'mydades'. The error line:
NA/NaN/Inf in foreign function call (arg 6)
makes me suspect that the knn-function call to the C language implementation fails. Many functions in R actually call underlying, more efficient C implementations, instead of having an algorithm implemented in just R. If you type just 'knn' in your R console, you can inspect the R implementation of 'knn'. There exists the following line:
Z <- .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr),
as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)),
as.double(test), res = integer(nte), pr = double(nte),
integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all))
where .C means that we're calling a C function named 'VR_knn' with the provided function arguments. Since you have two of the errors
NAs introduced by coercion
I think two of the as.double/as.integer calls fail, and introduce NA values. If we start counting the parameters, the 6th argument is:
as.double(train)
that may fail in cases such as:
# as.double can not translate text fields to doubles, they are coerced to NA-values:
> as.double("sometext")
[1] NA
Warning message:
NAs introduced by coercion
# while the following text is cast to double without an error:
> as.double("1.23")
[1] 1.23
You get two of the coercion errors, which are probably given by 'as.double(train)' and 'as.double(test)'. Since you did not provide us with exact details of how 'mydades' is, here are some of my best guesses (and an artificial multivariate normal distribution data):
library(MASS)
mydades <- mvrnorm(100, mu=c(1:6), Sigma=matrix(1:36, ncol=6))
mydades <- cbind(mydades, sample(LETTERS[1:5], 100, replace=TRUE))
# This breaks knn
mydades[3,4] <- Inf
# This breaks knn
mydades[4,3] <- -Inf
# These, however, do not introduce the coercion for NA-values error message
# This breaks knn and gives the same error; just some raw text
mydades[1,2] <- mydades[50,1] <- "foo"
mydades[100,3] <- "bar"
# ... or perhaps wrongly formatted exponential numbers?
mydades[1,1] <- "2.34EXP-05"
# ... or wrong decimal symbol?
mydades[3,3] <- "1,23"
# should be 1.23, as R uses '.' as decimal symbol and not ','
# ... or most likely a whole column is non-numeric, since the error is given twice (as.double problem both in training AND test set)
mydades[,1] <- sample(letters[1:5],100,replace=TRUE)
I would not keep both the numeric data and class labels in a single matrix, perhaps you could split the data as:
mydadesnumeric <- mydades[,1:6] # 6 first columns
mydadesclasses <- mydades[,7]
Using calls
str(mydades); summary(mydades)
may also help you/us in locating the problematic data entries and correct them to numeric entries or omitting non-numeric fields.
The rest of the run code (after breaking the data), as provided by you:
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
# 7th column seems to be the class labels
knn(train=mydades.training[,-7],test=mydades.test[,-7],mydades.training[,7],k=5)
Great answer by#Teemu.
As this is a well-read question, I will give the same answer from an analytics perspective.
The KNN function classifies data points by calculating the Euclidean distance between the points. That's a mathematical calculation requiring numbers. All variables in KNN must therefore be coerce-able to numerics.
The data preparation for KNN often involves three tasks:
(1) Fix all NA or "" values
(2) Convert all factors into a set of booleans, one for each level in the factor
(3) Normalize the values of each variable to the range 0:1 so that no variable's range has an unduly large impact on the distance measurement.
I would also point out that the function seems to fail when using integers. I needed to convert everything into "num" type prior to calling the knn function. This includes the target feature, which most methods in R use the factor type. Thus, as.numeric(my_frame$target_feature) is required.

Resources