I'm using table to show results from the kmeans cluster vs. the actual class values.
How can I calculate the % accuracy based on that table. I know how to do it manually.
Iris-setosa had all 50 in cluster 2 while Iris-versicolor had two in the other cluster.
Is there a way to calculate the % like Incorrectly classified instances: 52%
I would like to print the confusion matrix by classes and clusters. Something lke this:
0 1 <-- assigned to cluster
380 120 | 1
135 133 | 0
Cluster 0 <-- 1
Cluster 1 <-- 0
Incorrectly clustered instances : 255.0 33.2031 %
You can use diag() to select the cases on the diagonal and use that to calculate (in)accuracy as shown below:
sum(diag(d))/sum(d) #overall accuracy
1-sum(diag(d))/sum(d) #incorrect classification
You can also use this to calculate the number of cases (in)correctly classified:
sum(diag(d)) #N cases correctly classified
sum(d)-sum(diag(d)) #N cases incorrectly classified
where d is your confusion matrix
Related
I have a question relating to the “randomForest” package in R. I am trying to build a model with ecological variables that best explain my species occupancy data for 41 sites in the field (which I have gathered from camera traps). My ultimate goal is to do species occupancy modeling using the “unmarked” package but before I get to that stage I need to select the variables that are best explaining my occupancy, since I have many. To gain some understanding of the randomForest package I generated a fake occupancy dataset and a fake variable dataset (with variables A and D being good predictors of my occupancy and B and C being bad predictors). When I run the randomForest my output looks like this:
0 1 MeanDecreaseAccuracy MeanDecreaseGini
A 25.3537667 27.75533 26.9634018 20.6505920
B 0.9567857 0.00000 0.9665287 0.0728273
C 0.4261638 0.00000 0.4242409 0.1411643
D 32.1889374 35.52439 34.0485837 27.0691574
OOB estimate of error rate: 29.02%
Confusion matrix:
0 1 class.error
0 250 119 0.3224932
1 0 41 0.0000000
I did not make a separate train and test set, I put extra weight on the model to correctly predict the “1’s” and the variables are scaled.
I understand that this output tells me that A and D are important variables because they have high MeanDecreaseAccuracy values. However, D is the inverse of A (they are perfectly correlated) so why does D have a higher MeanDecreaseAccuracy value?
Moreover, when I run the randomForest with only A and D as variables, these values change while the confusion matrix stays the same:
0 1 MeanDecreaseAccuracy MeanDecreaseGini
A 28.79540 29.77911 29.00879 23.58469
D 29.75068 30.79498 29.97520 24.53415
OOB estimate of error rate: 29.02%
Confusion matrix:
0 1 class.error
0 250 119 0.3224932
1 0 41 0.0000000
When I run the model with only 1 good predictor (A or D) or with a good and bad predictor (AB or CD) the confusion matrix stays the same but the MeanDecreaseAccuracy values of my predictors change.
Why do these values change and how should I approach the selection of my variables? (I am a beginner in occupancy modeling).
Thanks a lot!
I have a large dataset of individuals who have rated some items (x1:x10). For each individual. the ratings have been combined into a total score (ranging 0-5). Now, I would like to draw two subsamples with the same sample size, in which the total score has a specific mean (1.5 and 3) and follows a normal distribution. Individuals might be part of both subsamples.
A guess to solve this, sampling with the outlined specifications from a vector (the total score) will work. Unfortunately, I have found only found different ways to draw random samples from a vector, but not a way to sample around a specific mean.
EDIT:
As pointed out I normal distribution would not be possible. Rather than I am looking way to sample a binomial distribution (directly from the data, without the work around of creating a similar distribution and matching).
You can't have normally distributed data on a discrete scale with hard limits. A sample drawn from a normal distribution with a mean between 0 and 5 would be symmetrical around the mean, would take coninuous rather than discrete values and would have a non-zero probability of containing values of less than zero and more than 5.
You want your sample to contain discrete values between zero and five and to have a central tendency around the mean. To emulate scores with a particular mean you need to sample from the binomial distribution using rbinom.
get_n_samples_averaging_m <- function(n, m)
{
rbinom(n, 5, m/5)
}
Now you can do
samp <- get_n_samples_averaging_m(40, 1.5)
print(samp)
# [1] 1 3 2 1 3 2 2 1 1 1 1 2 0 3 0 0 2 2 2 3 1 1 1 1 1 2 1 2 0 1 4 2 0 2 1 3 2 0 2 1
mean(samp)
# [1] 1.5
I compared the silhouette widths of different cluster algorithms on the same dataset: k-means, clara and pam. I can see which one scores the highest on silhouette width. But can I now statistically test whether the solutions differ from each other kind of as we normally do with ANOVA?
I formulated the hypothesis for my thesis that clara and pam would give more valid results than k-means. I know the silhouette width of both of them is higher, but I don't know how I can statistically confirm/disconfirm my hypothesis.
#######4: Behavioral Clustering
##4.1 Kmeans
kmeans.res.4.1 <- kmeans(ClusterDFSBeha, 2)
print(kmeans.res.4.1)
#Calculate SW
library(clValid)
intern4.1 <- clValid(ClusterDFSBeha, 2, clMethods="kmeans",validation="internal", maxitems = 9800)
summary(intern4.1)
#Silhouette width = 0.7861
##4.2 PAM
pam.res.4.2 <- pam(ClusterDFSBeha, 2)
print(pam.res.4.2)
intern4.2 <- clValid(ClusterDFSBeha, 2, clMethods="pam", validation="internal", maxitems = 9800)
summary(intern4.2)
#Silhouette width = 0.6702
##4.3 Clara
clara.res.4.3 <- clara(ClusterDFSBeha,2)
print(clara.res.4.3)
intern4.3 <- clValid(ClusterDFSBeha, 2, clMethods="clara", validation="internal", maxitems = 9800)
summary(intern4.3)
#Silhouette width = 0.8756
Now I would like to statistically assess whether the methods statistically 'differ' from each other to be able to reject or approve my hypothesis with a certain p level.
It is not a perfect answer.
If you want to test the "quality" of a clustering method, the better thing is to look at the partition given by the algorithm.
For the checking you can compare partition through measure like ARI (Adjusted Rank Index), we call that relative performance. Another idea is to use simulated data where you know true label and thanks to them you can compare your result, how far you are from the truth. The last one, I know is to asses the stability of your clustering method to small perturbation of the data: the gap algorithm of Rob Tibshirani.
But in fact in clustering theory (unsupervised classification) it is really hard to evaluate the pertinency of a cluster. We have fewer selection model criteria than for supervised learning task.
I really advised you to look on internet, for instance this package description seems to be a good inroduction :
https://cran.r-project.org/web/packages/clValid/vignettes/clValid.pdf
To answer directly, I don't think that what you are looking for exist. If yes, I will be really happy to know more about it.
Such a comparison will never be fair.
Any such test makes some assumptions, and a clustering method that is based on similar assumptions is to be expected to score better.
For example if you use Silhouette with Euclidean distance, PAM with Euclidean distance, and k-means, it must be expected that PAM has an advantage. If you used Silhouette with squared Euclidean distances instead, k-means is almost certain going to fare best (and it is also almost certain to outperform PAM with squared Euclidean).
So you aren't judging which method is "better", but which correlates more with your evaluation method.
There is a simple way using the contingency tables. Let's say you get 1 set of cluster assignments ll and another cc and in the ideal situation you have those labels align perfectly and from that table you can produce a statistic using the chi squared test and the pvalue for the significance of the allocation differences;
ll1 = rep(c(4,3,2,1),100)
cc1 = rep(c(1:4),length(ll1)/4)
table(cc1, ll1)
print(paste("chi statistic=",chisq.test(cc1, ll1)$statistic ))
print(paste("chi pvalue=",chisq.test(cc1, ll1)$p.value ))
producing;
ll1
cc1 1 2 3 4
1 0 0 0 100
2 0 0 100 0
3 0 100 0 0
4 100 0 0 0
[1] "chi statistic= 1200"
[1] "chi pvalue= 1.21264177763119e-252"
meaning that the cell counts are not randomly (uniformly) allocated supporting an association. For a random allocation;
ll2 = sample(c(4,3,2,1),100,replace=TRUE)
cc2 = sample(c(1:4),length(ll2),replace=TRUE)
table(cc2, ll2)
print(paste("chi statistic=",chisq.test(cc2, ll2)$statistic ))
print(paste("chi pvalue=",chisq.test(cc2, ll2)$p.value ))
with outputs
ll2
cc2 1 2 3 4
1 6 7 6 10
2 5 5 7 9
3 6 7 7 4
4 4 8 5 4
[1] "chi statistic= 4.96291083483202"
[1] "chi pvalue= 0.837529350518186"
supporting that there is no association.
You can use this for your cluster assignments from different algorithms, to see if they are randomly associated or not.
You can also use; ** Variation of Information Distance for Clusterings** to get the distance between the assignments. for ll1 and cc1 ('mcclust' R package)
vi.dist(bb1,cc1)
vi.dist(bb1,cc1, parts=TRUE)
you get
0
vi0H(1|2)0H(2|1)0
and for the sampled ll2 and cc2
vi.dist(aa2,cc2)
vi.dist(aa2,cc2, parts=TRUE)
3.68438190593985
vi3.68438190593985H(1|2)1.84631473075115H(2|1)1.83806717518869
There's also the V-measure you can apply
I am using survey package to analyze data from a two-phase survey. The first phase included ~5000 cases, and the second ~2700. Weights were calculated beforehand to adjust for several variables (ethnicity, sex, etc.) AND(as far as I understand) for the decrease in sample size when performing phase 2.
I'm interested in proportions of binary variables, e.g. suicides in sick vs. healthy individuals.
An example of a simple output I receive in the overall sample:
table (df$schiz,df$suicide)
0 1
0 4857 8
1 24 0
An example of a simple output I receive in the second phase sample only:
table (df2$schiz,df2$suicide)
0 1
0 2685 5
1 24 0
And within the phase two sample with the weights included:
dfw<-svydesign(ids=~1,data=df2, weights=df2$weights)
svytable (~schiz+suicide, design=dfW)
suicide
schiz 0 1
0 2701.51 2.67
1 18.93 0.00
My question is: Shouldn't the weights correct for the decrease in N when moving from phase 1 to phase 2? i.e. shouldn't the total N of the second table after correction be ~5000 cases, and not ~2700 as it is now?
I've used mclust to find clusters in a dataset. Now I want to implement these findings into external non-r software (predict.Mclust is thus not an option as has been suggested in previous similar Questions) to classify new observations. I need to know how mclust classifies observations.
Since mclust outputs a center and a covariance matrix for each cluster it felt reasonable to calculate mahalanobis distance for every observation and for every cluster. Observations could then be classified to the mahalonobi-nearest cluster. It seems not not to work fully however.
Example code with simulated data (in this example I only use one dataset, d, and try to obtain the same classification as mclust does by the mahalanobi approach outlined above):
set.seed(123)
c1<-mvrnorm(100,mu=c(0,0),Sigma=matrix(c(2,0,0,2),ncol=2))
c2<-mvrnorm(200,mu=c(3,3),Sigma=matrix(c(3,0,0,3),ncol=2))
d<-rbind(c1,c2)
m<-Mclust(d)
int_class<-m$classification
clust1_cov<-m$parameters$variance$sigma[,,1]
clust1_center<-m$parameters$mean[,1]
clust2_cov<-m$parameters$variance$sigma[,,2]
clust2_center<-m$parameters$mean[,2]
mahal_clust1<-mahalanobis(d,cov=clust1_cov,center=clust1_center)
mahal_clust2<-mahalanobis(d,cov=clust2_cov,center=clust2_center)
mahal_clust_dist<-cbind(mahal_clust1,mahal_clust2)
mahal_classification<-apply(mahal_clust_dist,1,function(x){
match(min(x),x)
})
table(int_class,mahal_classification)
#List mahalanobis distance for miss-classified observations:
mahal_clust_dist[mahal_classification!=int_class,]
plot(m,what="classification")
#Indicate miss-classified observations:
points(d[mahal_classification!=int_class,],pch="X")
#Results:
> table(int_class,mahal_classification)
mahal_classification
int_class 1 2
1 124 0
2 5 171
> mahal_clust_dist[mahal_classification!=int_class,]
mahal_clust1 mahal_clust2
[1,] 1.340450 1.978224
[2,] 1.607045 1.717490
[3,] 3.545037 3.938316
[4,] 4.647557 5.081306
[5,] 1.570491 2.193004
Five observations are classified differently between the mahalanobi approach and mclust. In the plots they are intermediate points between the two clusters. Could someone tell me why it does not work and how I could mimic the internal classification of mclust and predict.Mclust?
After formulating the above question I did some additional research (thx LoBu) and found that the key was to calculate the posterior probability (pp) for an observation to belong to a certain cluster and classify according to maximal pp. The following works:
denom<-rep(0,nrow(d))
pp_matrix<-matrix(rep(NA,nrow(d)*2),nrow=nrow(d))
for(i in 1:2){
denom<-denom+m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i])
}
for(i in 1:2){
pp_matrix[,i]<-m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i]) / denom
}
pp_class<-apply(pp_matrix,1,function(x){
match(max(x),x)
})
table(pp_class,m$classification)
#Result:
pp_class 1 2
1 124 0
2 0 176
But if someone in layman terms could explain the difference between the mahalanobi and pp approach I would be greatful. What do the "mixing probabilities" (m$parameters$pro) signify?
In addition to Mahalanobis distance, you also need to take the cluster weight into account.
These weight the relative importance of clusters when they overlap.