Clustering with Mclust results in an empty cluster - r

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!

As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

Related

kernel PCA with Kernlab and classification of Colon--cancer dataset

I need to Perform kernel PCA on the colon-­‐cancer dataset:
and then
I need to Plot number of principal components vs classification accuracy with PCA data.
For the first part i am using kernlab in R as follows (let number of features be 2 and then i will vary it from say 2-100)
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=2)
I am having tough time to understand how to use this PCA data for classification ( i can use any classifier for eg SVM)
EDIT : My Question is how to feed the output of PCA into a classifier
data looks like this (cleaned data)
uncleaned original data looks like this
I will show you a small example on how to use the kpca function of the kernlab package here:
I checked the colon-cancer file but it needs a bit of cleaning to be able to use it so I will use a random data set to show you how:
Assume the following data set:
y <- rep(c(-1,1), c(50,50))
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
x4 <- runif(100)
x5 <- runif(100)
df <- data.frame(y,x1,x2,x3,x4,x5)
> df
y x1 x2 x3 x4 x5
1 -1 0.125841208 0.040543611 0.317198114 0.40923767 0.635434021
2 -1 0.113818719 0.308030825 0.708251147 0.69739496 0.839856000
3 -1 0.744765204 0.221210582 0.002220568 0.62921565 0.907277935
4 -1 0.649595597 0.866739474 0.609516644 0.40818013 0.395951297
5 -1 0.967379006 0.926688915 0.847379556 0.77867315 0.250867680
6 -1 0.895060293 0.813189446 0.329970821 0.01106764 0.123018797
7 -1 0.192447416 0.043720717 0.170960540 0.03058768 0.173198036
8 -1 0.085086619 0.645383728 0.706830885 0.51856286 0.134086770
9 -1 0.561070374 0.134457795 0.181368729 0.04557505 0.938145228
In order to run the pca you need to do:
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=4)
which is the same way as you use it. However, I need to point out that the features argument is the number of principal components and not the number of classes in your y variable. Maybe you knew this already but having 2000 variables and producing only 2 principal components might not be what you are looking for. You need to choose this number carefully by checking the eigen values. In your case I would probably pick 100 principal components and chose the first n number of principal components according to the highest eigen values. Let's see this in my random example after running the previous code:
In order to see the eigen values:
> kpc#eig
Comp.1 Comp.2 Comp.3 Comp.4
0.03756975 0.02706410 0.02609828 0.02284068
In my case all of the components have extremely low eigen values because my data is random. In your case I assume you will get better ones. You need to choose the n number of components that have the highest values. A value of zero shows that the component does not explain any of the variance. (Just for the sake of the demonstration I will use all of them in the svm below).
In order to access the principal components i.e. the PCA output you do this:
> kpc#pcv
[,1] [,2] [,3] [,4]
[1,] -0.1220123051 1.01290883 -0.935265092 0.37279158
[2,] 0.0420830469 0.77483019 -0.009222970 1.14304032
[3,] -0.7060568260 0.31153129 -0.555538694 -0.71496666
[4,] 0.3583160509 -0.82113573 0.237544936 -0.15526000
[5,] 0.1158956953 -0.92673486 1.352983423 -0.27695507
[6,] 0.2109994978 -1.21905573 -0.453469345 -0.94749503
[7,] 0.0833758766 0.63951377 -1.348618472 -0.26070127
[8,] 0.8197838629 0.34794455 0.215414610 0.32763442
[9,] -0.5611750477 -0.03961808 -1.490553198 0.14986663
...
...
This returns a matrix of 4 columns i.e. the number of the features argument which is the PCA output i.e. the principal components. kerlab uses the S4 Method Dispatch System and that is why you use # at kpc#pcv.
You then need to use the above matrix to feed in an svm in the following way:
svmmatrix <- kpc#pcv
library(e1071)
svm(svmmatrix, as.factor(y))
Call:
svm.default(x = svmmatrix, y = as.factor(y))
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.25
Number of Support Vectors: 95
And that's it! A very good explanation I found on the internet about pca can be found here in case you or anyone else reading this wants to find out more.

R multiclass/multinomial classification ROC using multiclass.roc (Package ‘pROC’)

I am having difficulties understanding how the multiclass.roc parameters should look like.
Here a snapshot of my data:
> head(testing.logist$cut.rank)
[1] 3 3 3 3 1 3
Levels: 1 2 3
> head(mnm.predict.test.probs)
1 2 3
9 1.013755e-04 3.713862e-02 0.96276001
10 1.904435e-11 3.153587e-02 0.96846413
12 6.445101e-23 1.119782e-11 1.00000000
13 1.238355e-04 2.882145e-02 0.97105472
22 9.027254e-01 7.259787e-07 0.09727389
26 1.365667e-01 4.034372e-01 0.45999610
>
I tried calling multiclass.roc with:
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs,
formula=response~predictor
)
but naturally I get an error:
Error in roc.default(response, predictor, levels = X, percent = percent, :
Predictor must be numeric or ordered.
When it's a binary classification problem I know that 'predictor' should contain probabilities (one per observation). However, in my case, I have 3 classes, so my predictor is a list of rows that each have 3 columns (or a sublist of 3 values) correspond to the probability for each class.
Does anyone know how should my 'predictor' should look like rather than what it's currently look like ?
The pROC package is not really designed to handle this case where you get multiple predictions (as probabilities for each class). Typically you would assess your P(class = 1)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs[,1])
And then do it again with P(class = 2) and P(class = 3). Or better, determine the most likely class:
predicted.class <- apply(mnm.predict.test.probs, 1, which.max)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=predicted.class)
Consider multiclass.roc as a toy that can sometimes be helpful but most likely won't really fit your needs.

Consistent K-Means Clustering Results

I have read k-means: Same clusters for every execution.
But it doesn't solve the problem I am having. I am sampling data that varies in sizes (increases in sizes). I need to cluster the data using k-means but the problem I am having is that each sample the clusters differ. The important thing to note is that my t+1 sample will always incorporate all of the components from the tth sample. So it slowly gets bigger and bigger. What I need is a way to be able to have the clusters stay the same. Is there a way around this other than using set.seeds? I am open to any solution.
The best way I can think to accomplish this would be to initially cluster the data with k-means and then to simply assign all additional data to closest cluster (setting the random seed will not help you to get the new clusters to nest within the original ones). As detailed in the answer to this question, the flexclust package makes this pretty easy:
# Split into "init" (used for initial clustering) and "later" (assigned later)
set.seed(100)
spl <- sample(nrow(iris), 0.5*nrow(iris))
init <- iris[spl,-5]
later <- iris[-spl,-5]
# Build the initial k-means clusters with "init"
library(flexclust)
(km <- kcca(init, k=3, kccaFamily("kmeans")))
# kcca object of family ‘kmeans’
#
# call:
# kcca(x = init, k = 3, family = kccaFamily("kmeans"))
#
# cluster sizes:
#
# 1 2 3
# 31 25 19
# Assign each element of "later" to the closest cluster
head(predict(km, newdata=later))
# 2 5 7 9 14 18
# 2 2 2 2 2 2

R e1071: Balanced Error Rate (BER) as error criterion in tune function

I'm kind of new to R and machine learning in general, so apologies if this seems stupid!
I'm using the e1071 package to tune the parameters of various models. My dataset is very unbalanced and I would like for the error criterion to be Balanced Error Rate... NOT overall classification error. However, I'm stumped as how to achieve this.
Here is my code:
#Find optimal value 'k' value for k-NN model (feature subset).
c <- data_train_sub[1:13]
d <- data_train_sub[,14]
knn2 <- tune.knn(c, d, k = 1:10, tunecontrol = tune.control(sampling = "cross", performances = TRUE, sampling.aggregate = mean)
)
summary(knn2)
plot(knn2)
Which returns this:
Parameter tuning of ‘knn.wrapper’:
- sampling method: 10-fold cross validation
- best parameters:
k
1
- best performance: 0.001190476
- Detailed performance results:
k error dispersion
1 1 0.001190476 0.003764616
2 2 0.005952381 0.006274360
3 3 0.003557423 0.005728122
4 4 0.005924370 0.008352124
5 5 0.005938375 0.008407043
6 6 0.005938375 0.008407043
7 7 0.007128852 0.008315090
8 8 0.009495798 0.009343555
9 9 0.008305322 0.009751997
10 10 0.008319328 0.009795292
Has anyone any experience of altering the error being assessed in this function?
Look at the class.weights argument of the svm() function:
a named vector of weights for the different classes, used for asymmetric class sizes...
Coefficient can easily be calculated as such:
class.weights = table(Xcal$species)/sum(table(Xcal$species))

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

Resources