I have a data frame, df, containing the x and y coordinates of a bunch of points. Here's an excerpt:
> tail(df)
x y
1495 0.627174 0.120215
1496 0.616036 0.123623
1497 0.620269 0.122713
1498 0.630231 0.110670
1499 0.611844 0.111593
1500 0.412236 0.933250
I am trying to find out the most appropriate number of clusters. Ultimately the goal is to do this with tens of thousands of these data frames, so the method of choice must be quick and can't be visual. Based on those requirements, it seems like the RWeka package is the way to go.
I managed to successfully load the RWeka package (I had to install Java SE Runtime in my computer first) and also RWeka's package XMeans, and run it:
library("RWeka") # requires Java SE Runtime
WPM("refresh-cache") # Build Weka package metadata cache
WPM("install-package", "XMeans") # Install XMeans package if not previously installed
weka_ctrl <- Weka_control( # Create a Weka control object to specify our parameters
I = 100, # max no iterations overall
M = 100, # max no iterations in the kmeans loop
L = 2, # min no clusters
H = 5, # max no clusters
D = "weka.core.EuclideanDistance", # distance metric
C = 0.4, S = 1)
x_means <- XMeans(df, control = weka_ctrl) # run algorithm on data
This produces exactly the result I want:
XMeans
======
Requested iterations : 100
Iterations performed : 1
Splits prepared : 2
Splits performed : 0
Cutoff factor : 0.4
Percentage of splits accepted
by cutoff factor : 0 %
------
Cutoff factor : 0.4
------
Cluster centers : 2 centers
Cluster 0
0.4197712002617799 0.9346986806282739
Cluster 1
0.616697959239131 0.11564350951086963
Distortion: 30.580934
BIC-Value : 2670.359509
I can assign each point in my data-frame to a cluster by running x_means$class_ids.
However, I would like to have a way of retrieving the coordinates of the cluster centres. I can see them in the output and write them down manually, but if I am to run tens of thousands of these, I need to be able to have a piece of code that saves them into a variable. I can't seem to subset x_means by using square brackets, so I don't know what else to do.
Thank you so much in advance for your help!
The centers do not seem to be directly stored in the structure that is returned. However, since the structure does tell you which cluster each point belongs to, it is easy to compute the centers. Since you do not provide your data, I will illustrate with the built-in iris data.
As you observed, printing out the result shows the centers. we can use this to check the result.
x_means <- XMeans(iris[,1:4], control = weka_ctrl)
x_means
## Output truncated to just the interesting part.
Cluster centers : 2 centers
Cluster 0
6.261999999999998 2.872000000000001 4.906000000000001 1.6760000000000006
Cluster 1
5.005999999999999 3.428000000000001 1.4620000000000002 0.2459999999999999
So here's how to compute that
colMeans(iris[x_means$class_ids==0,1:4])
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.262 2.872 4.906 1.676
colMeans(iris[x_means$class_ids==1,1:4])
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.006 3.428 1.462 0.246
The results agree.
Related
I want to calculate silhouette for cluster evaluation. There are some packages in R, for example cluster and clValid. Here is my code using cluster package:
# load the data
# a data from the UCI website with 434874 obs. and 3 variables
data <- read.csv("./data/spatial_network.txt",sep="\t",header = F)
# apply kmeans
km_res <- kmeans(data,20,iter.max = 1000,
nstart=20,algorithm="MacQueen")
# calculate silhouette
library(cluster)
sil <- silhouette(km_res$cluster, dist(data))
# plot silhouette
library(factoextra)
fviz_silhouette(sil)
The code works well for smaller data, say data with 50,000 obs, however I get an error like "Error: cannot allocate vector of size 704.5 Gb" when the data size is a bit large. This might be problem for Dunn index and other internal indices for large datasets.
I have 32GB RAM in my computer. The problem comes from calculating dist(data). I am wondering if it is possible to not calculating dist(data) in advance, and calculate corresponding distances when it is required in the silhouette formula.
I appreciate your help regarding this problem and how I can calculate silhouette for large and very large datasets.
You can implement Silhouette yourself.
It only needs every distance twice, so storing an entire distance matrix is not necessary. It may run a bit slower because it computes distances twice, but at the same time the better memory efficiency may well make up for that.
It will still take a LONG time though.
You should consider to only use a subsample (do you really need to consider all points?) as well as alternatives such as Simplified Silhouette, in particular with KMeans... You only gain very little with extra data on such methods. So you may just use a subsample.
Anony-Mousse answer is perfect, particularly subsampling. This is very important for very large datasets due to the increase in computational cost.
Here is another solution for calculating internal measures such as silhouette and Dunn index, using an R package of clusterCrit. clusterCrit is for calculating clustering validation indices, which does not require entire distance matrix in advance. However, it might be slow as Anony-Mousse discussed. Please see the below link for documentation for clusterCrit:
https://www.rdocumentation.org/packages/clusterCrit/versions/1.2.8/topics/intCriteria
clusterCrit also calculates most of Internal measures for cluster validation.
Example:
intCriteria(data,km_res$cluster,c("Silhouette","Calinski_Harabasz","Dunn"))
If it is possible to calculate the Silhouette index, without using the distance matrix, alternatively you can use the clues package, optimizing both the time and the memory used by the cluster package. Here is an example:
library(rbenchmark)
library(cluster)
library(clues)
set.seed(123)
x = c(rnorm(1000,0,0.9), rnorm(1000,4,1), rnorm(1000,-5,1))
y = c(rnorm(1000,0,0.9), rnorm(1000,6,1), rnorm(1000, 5,1))
cluster = rep(as.factor(1:3),each = 1000)
df <- cbind(x,y)
head(df)
x y
[1,] -0.50442808 -0.13527673
[2,] -0.20715974 -0.29498142
[3,] 1.40283748 -1.30334876
[4,] 0.06345755 -0.62755613
[5,] 0.11635896 2.33864121
[6,] 1.54355849 -0.03367351
Runtime comparison between the two functions
benchmark(f1 = silhouette(as.integer(cluster), dist = dist(df)),
f2 = get_Silhouette(y = df, mem = cluster))
test replications elapsed relative user.self sys.self user.child sys.child
1 f1 100 15.16 1.902 13.00 1.64 NA NA
2 f2 100 7.97 1.000 7.76 0.00 NA NA
Comparison in memory usage between the two functions
library(pryr)
object_size(silhouette(as.integer(cluster), dist = dist(df)))
73.9 kB
object_size(get_Silhouette(y = df, mem = cluster))
36.6 kB
As a conclusion clues::get_Silhouette, it reduces the time and memory used to the same.
This issue applies to my own data, but for the sake of reproducability, my issue/question is also present in the FactoExtra vignette, or here, so I'll use that for the sake of simplicity.
To start, a simple PCA was generated (scale = T) and the coordinate variables from the first 4 axes extracted:
head(var$coord) # coordinates of variables
> Dim.1 Dim.2 Dim.3 Dim.4
> Sepal.Length 0.8901688 -0.36082989 0.27565767 0.03760602
> Sepal.Width -0.4601427 -0.88271627 -0.09361987 -0.01777631
> Petal.Length 0.9915552 -0.02341519 -0.05444699 -0.11534978
> Petal.Width 0.9649790 -0.06399985 -0.24298265 0.07535950
This was also done for the "individuals." Here is the output:
head(ind$coord) # coordinates of individuals
> Dim.1 Dim.2 Dim.3 Dim.4
> 1 -2.257141 -0.4784238 0.12727962 0.024087508
> 2 -2.074013 0.6718827 0.23382552 0.102662845
> 3 -2.356335 0.3407664 -0.04405390 0.028282305
4 -2.291707 0.5953999 -0.09098530 -0.065735340
5 -2.381863 -0.6446757 -0.01568565 -0.035802870
6 -2.068701 -1.4842053 -0.02687825 0.006586116
Since the PCA was generated with scale=T, I'm highly confused as to why the individual coordinates are not scaled (-1 to 1?). For instance, "individual 1" has a DIM-1 score of -2.257141, but I have no comparative basis for the variable coordinates which range from -0.46 to 0.991. How can a score of -2.25 be interpreted with a scaled PCA range of -1 to 1?
Am I missing something?
Thanks for your time!
Updated with all relevant code gaps filled:
> data(iris)
> res.pca <- prcomp(iris[, -5], scale = TRUE)
> ind <- get_pca_ind(res.pca)
> print(ind)
>var <- get_pca_var(res.pca)
> print(var)
I asked the author of FactoExtra this question. Here was his reply:
Scale = TRUE will normalize the variables to make them comparable. This is particularly recommended when variables are measured in different scales (e.g: kilograms, kilometers, centimeters, …);(http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/).
In this case, the correlation between a variable and a principal component (PC) is used as the coordinates of the variable on the PC. The representation of variables differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations.
So, the coordinates of individuals are not expected to be between -1 and 1, even if scale = TRUE.
It’s only possible to interpret the relative position of individuals and variables by creating a biplot as described at: http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/.
A biplot isn't idea for me, but I have tried rescale and it works. Also, I suppose I could take an individual and project them into the PCA to see where they fall.
Anyways, that's the end of that. Thanks for your help #Hack-r!
The scaling that is being done when prcomp(...,scale=T) is scaling of the input variables to unit variance.
I don't think it does anything about range standardization of the individual co-ordinates, unless perhaps center = ... is used. However, it would be easy to do post-hoc (or pre). Here's a related post:
Range standardization (0 to 1) in R
I am running spBayes to fit an 'offset' model y ~ 1.
I have a dataframe like this
ID lon lat y
1 A 90.0 5.9 0.957096100
2 A 90.5 6.0 0.991374969
3 A 91.1 6.0 0.991374969
4 A 92.7 6.1 0.913501740
5 A 94.0 6.1 0.896575928
6 A 97.8 5.2 0.631320953
7 A 98.9 4.4 -0.282432556
8 A 101.2 2.8 1.829053879
9 A 102.3 2.0 0.993621826
10 A 105.8 0.5 0.038677216
where the variable ID is a factor with two levels A and B. I would like to find a offset for the two IDs. However, when I run
fit.by.ALL <- spLM(formula=y ~ ID, data= df, coords=coords,
priors=priors, tuning=tuning, starting=starting,
cov.model = "exponential", n.samples = n.samples, verbose = TRUE,
n.report = 50)
which gives the result
Iterations = 1:251
Thinning interval = 1
Number of chains = 1
Sample size per chain = 251
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
(Intercept) 1.0736 2.8674 0.18099 0.18099
IDB -0.9188 0.1922 0.01213 0.01213
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
(Intercept) -4.952 -0.773 1.1059 3.0165 6.4824
IDB -1.303 -1.048 -0.9284 -0.7679 -0.5795
the result doesn't like is very stable as it keep changing every time I run it.
Moreover, to find the final offset for the ID B I need to add the (Intercept) Mean to the IDB Mean, how does it work for the SD?
Would it be better to run the spLM formula separately for the two IDs (with y~1 instead of y~ID)?
Thanks
I am unclear what you mean by "fit an offset model y ~ 1". When I read this, I think you want a model that only has an intercept, but reading further it suggests you want a model where you can estimate the mean for both groups which can be done using
y ~ 0+ID # manually remove the intercept,
To answer your questions:
the result doesn't like is very stable as it keep changing every time
I run it.
You are not using very many iterations. Try running with more iterations. With enough iterations the results should be stable.
Moreover, to find the final offset for the ID B I need to add the (Intercept) Mean to the IDB Mean, how does it work for the SD?
Again, I'm not sure what you mean by offset, but if you mean you want the difference in means between group A and group B, this this is exactly what you have in the line beginning with IDB. That is, -0.9188 is the estimated difference in means between group B and group A, i.e. group B's mean is estimated to be 0.9188 smaller than group B's mean, and the SD is the posterior standard deviation.
If you are interested in group B's mean, then you are correct that you must add the (Intercept) to the IDB, but you cannot simply add the SDs. You have two options here: 1) use an appropriate design matrix (X in the above code) that directly obtains your desired parameter estimates or 2) obtain the MCMC samples and calculate the sum of the (Intercept) and IDB parameters for each iteration and then take means and standard deviations of these sums.
Would it be better to run the spLM formula separately for the two IDs (with y~1 instead of y~ID)?
If you ran them separately, then you would be estimating the spatial parameters separately. If the spatial parameters are different in the two different groups, running them separately makes a lot of sense. If they are the same (or similar) then it probably makes more sense to fit the two groups together so you can "borrowing information" about the spatial parameters between these two groups.
This is home work.
I have a 2 matrices, 1 for training and 1 for test.
The data has two columns of data which shall be used for classification and a third column with the known class. Both matrices has the third column.
[1] [2] [3]
[1] 6.4 0.32 2
[2] 4.8 0.34 0
[3] 4.9 0.25 2
[4] 7.2 0.32 1
where the intergers are the classes (from 0-2).
The dimension for my datasets are 100 3 for the training set and 38 3 for the testing set.
I have tried to use the knn() function of the class lbrary.
knn uses the follwing arguments: (train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)
I have tried to use my data sets directly, but then i get the error: "'train' and 'class' have different lengths"
I have tried a few thing, but I am stuck after some hours now. At the moment I have this code in my editor:
cl <- t(factor(c(rep("0",1), rep("1",1), rep("2",1))))
k <- knn(train, test, cl)
but it does not work. Can someone help me?
I want to find run the function with 3 different k-values, and find the accuracy of each. After that I will be 5-fold cross validating the best k.
As the documentation states cl is the factor of true classifications of the training set i.e. your y variable (the third column of your training set).
This means that the function should be as follows:
cl <- factor(c(2,0,2,1)) #or alternatively factor(train[,3])
k <- knn(train[,c(1,2)], test[,c(1,2)], cl)
As you can see in both training and test sets the y variable (the column with the classes) is not included in the test and training sets. The column is only included as a factor in the cl argument.
The error you received is because the number of rows of the training set is not equal to the length of the factor in which case it only had 3 elements (which is because you thought you only needed to specify the levels of the factor there).
I have read k-means: Same clusters for every execution.
But it doesn't solve the problem I am having. I am sampling data that varies in sizes (increases in sizes). I need to cluster the data using k-means but the problem I am having is that each sample the clusters differ. The important thing to note is that my t+1 sample will always incorporate all of the components from the tth sample. So it slowly gets bigger and bigger. What I need is a way to be able to have the clusters stay the same. Is there a way around this other than using set.seeds? I am open to any solution.
The best way I can think to accomplish this would be to initially cluster the data with k-means and then to simply assign all additional data to closest cluster (setting the random seed will not help you to get the new clusters to nest within the original ones). As detailed in the answer to this question, the flexclust package makes this pretty easy:
# Split into "init" (used for initial clustering) and "later" (assigned later)
set.seed(100)
spl <- sample(nrow(iris), 0.5*nrow(iris))
init <- iris[spl,-5]
later <- iris[-spl,-5]
# Build the initial k-means clusters with "init"
library(flexclust)
(km <- kcca(init, k=3, kccaFamily("kmeans")))
# kcca object of family ‘kmeans’
#
# call:
# kcca(x = init, k = 3, family = kccaFamily("kmeans"))
#
# cluster sizes:
#
# 1 2 3
# 31 25 19
# Assign each element of "later" to the closest cluster
head(predict(km, newdata=later))
# 2 5 7 9 14 18
# 2 2 2 2 2 2