K-means clustering with pre-defined centroids - r

I'm trying to run K-means algorithm with predefined centroids. I have had a look at the following posts:
1.R k-means algorithm custom centers
2.Set static centers for kmeans in R
However, every time I run the command:
km = kmeans(df_std[,c(10:13)], centers = centroids)
I get the following error:
**Error: empty cluster: try a better set of initial centers**
I have defined the centroids as:
centroids = matrix(c(140.12774, 258.62615, 239.36800, 77.43235,
33.37736, 58.73077, 68.80000, 12.11765,
0.8937264, 0.8118462, 0.8380000, 0.8052941,
11.989858, 12.000000, 8.970000, 1.588235),
ncol = 4, byrow = T)
And my data, is a subset of a data frame say: df_std. It has been scaled already
df_std[,c(10:13)]
I'm wondering why would the system give the above error?
Any help on this would be highly appreciated!

Use a nearest neighbor classifier using the centers only, do not recluster.
That means every point is labeled just as the nearest center. This is similar to k-means but you do not change the centers, you do not need to iterate, and every new data point can be processed independently and in any order. No problem arises when processing just a single point at a time (in your case, k-means failed because one cluster became empty!)

While browsing for the specific error that I posted above:
Error: empty cluster: try a better set of initial centers
I found the following link to a conversation:
http://r.789695.n4.nabble.com/Empty-clusters-in-k-means-possible-solution-td4667114.html
Broadly speaking, the above error is generated when the centroids don't match with the data.
It can happen when
k
is a number:
due to random starts of the k-means algorithm, there is a possibility that the centres do not match with data
It may also happen when
k
represents the centroids (my case). The problem was: my data was scaled but my centroids were unscaled.
The above shared link made me realise that there is a bug in my code. Hope it will help someone in a similar situation as mine!

Related

What are the rules for ppp objects? Is selecting two variables possible for an sapply function?

Working with code that describes a poisson cluster process in spatstat. Breaking down each line of code one at a time to understand. Easy to begin.
library(spatstat)
lambda<-100
win<-owin(c(0,1),c(0,1))
n.seeds<-lambda*win$xrange[2]*win$yrange[2]
Once the window is defined I then generate my points using a random generation function
x=runif(min=win$xrange[1],max=win$xrange[2],n=pmax(1,n.seeds))
y=runif(min=win$yrange[1],max=win$yrange[2],n=pmax(1,n.seeds))
This can be plotted straight away I know using the ppp function
seeds<-ppp(x=x,
y=y,
window=win)
plot(seeds)
The next line I add marks to the ppp object, it is apparently describing the angle of rotation of the points, I don't understand how this works right now but that is okay, I will figure out later.
marks<-data.frame(angles=runif(n=pmax(1,n.seeds),min=0,max=2*pi))
seeds1<-ppp(x=x,
y=y,
window=win,
marks=marks)
The first problem I encounter is that an objects called pops, describing the populations of the window, is added to the ppp object. I understand how the values are derived, it is a poisson distribution given the input value mu, which can be any value and the total number of observations equal to points in the window.
seeds2<-ppp(x=x,
y=y,
window=win,
marks=marks,
pops=rpois(lambda=5,n=pmax(1,n.seeds)))
My first question is, how is it possible to add a variable that has no classification in the ppp object? I checked the ppp documentation and there is no mention of pops.
The second question I have is about using double variables, the next line requires an sapply function to define dimensions.
dim1<-pmax(1,sapply(seeds1$marks$pops, FUN=function(x)rpois(n=1,sqrt(x))))
I have never seen the $ function being used twice, and seeds2$marks$pop returns $ operator is invalid for atomic vectors. Could you explain what is going on here?
Many thanks.
That's several questions - please ask one question at a time.
From your post it is not clear whether you are trying to understand someone else's code, or developing code yourself. This makes a difference to the answer.
Just to clarify, this code does not come from inside the spatstat package; it is someone's code using the spatstat package to generate data. There is code in the spatstat package to generate simulated realisations of a Poisson cluster process (which is I think what you want to do), and you could look at the spatstat code for rPoissonCluster to see how it can be done correctly and efficiently.
The code you have shown here has numerous errors. But I will start by answering the two questions in your title.
The rules for creating ppp objects are set out in the help file for ppp. The help says that if the argument window is given, then unmatched arguments ... are ignored. This means that in the line seeds2<-ppp(x=x,y=y,window=win,marks=marks,pops=rpois(lambda=5,n=pmax(1,n.seeds)))
the argument pops will be ignored.
The idiom sapply(seeds1$marks$pops, FUN=f) is perfectly valid syntax in R. If the object seeds1 is a structure or list which has a component named marks, which in turn is a structure or list which has a component named pops, then the idiom seeds1$marks$pops would extract it. This has nothing particularly to do with sapply.
Now turning to errors in the code,
The line n.seeds<-lambda*win$xrange[2]*win$yrange[2] is presumably meant to calculate the expected number of cluster parents (cluster seeds) in the window. This would only work if the window is a rectangle with bottom left corner at the origin (0,0). It would be safer to write n.seeds <- lambda * area(win).
However, the variable n.seeds is used later as it it were the number of cluster parents (cluster seeds). The author has forgotten that the number of seeds is random with a Poisson distribution. So, the more correct calculation would be n.seeds <- rpois(1, lambda * area(win))
However this is still not correct because cluster parents (seed points) outside the window can also generate offspring points inside the window. So, seed points must actually be generated in a larger window obtained by expanding win. The appropriate command used inside spatstat to generate the cluster parents is bigwin <- grow.rectangle(Frame(win), cluster_diameter) ; Parents <- rpoispp(lambda, bigwin)
The author apparently wants to assign two mark values to each parent point: a random angle and a random number pops. The correct way to do this is to make the marks a data frame with two columns, for example marks(seeds1) <- data.frame(angles=runif(n.seeds, max=2*pi), pops=rpois(n.seeds, 5))

Estimation to plot person-item map not feasible because items "have no 0-responses" in data matrix

I am trying to create a person item map that organizes the questions from a dataset in order of difficulty. I am using the eRm package and the output should looks like follows:
[person-item map] (https://hansjoerg.me/post/2018-04-23-rasch-in-r-tutorial_files/figure-html/unnamed-chunk-3-1.png)
So one of the previous steps, before running the function that outputs the map, I have to fit the data set to have a matrix which is the object that the plotting functions uses to create the actual map, but I am having an error when creating that matrix
I have already tried to follow and review some documentation that might be useful if you want to have some extra-information:
[Tutorial] https://hansjoerg.me/2018/04/23/rasch-in-r-tutorial/#plots
[Ploting function] https://rdrr.io/rforge/eRm/man/plotPImap.html
[Documentation] https://eeecon.uibk.ac.at/psychoco/2010/slides/Hatzinger.pdf
Now, this is the code that I am using. First, I install and load the respective libraries and the data:
> library(eRm)
> library(ltm)
Loading required package: MASS
Loading required package: msm
Loading required package: polycor
> library(difR)
Then I fit the PCM and generate the object of class Rm and here is the error:
*the PCM function here is specific for polytomous data, if I use a different one the output says that I am not using a dichotomous dataset
> res <- PCM(my.data)
>Warning:
The following items have no 0-responses:
AUT_10_04 AUN_07_01 AUN_07_02 AUN_09_01 AUN_10_01 AUT_11_01 AUT_17_01
AUT_20_03 CRE_05_02 CRE_07_04 CRE_10_01 CRE_16_02 EFEC_03_07 EFEC_05
EFEC_09_02 EFEC_16_03 EVA_02_01 EVA_07_01 EVA_12_02 EVA_15_06 FLX_04_01
... [rest of items]
>Responses are shifted such that lowest
category is 0.
Warning:
The following items do not have responses on
each category:
EFEC_03_07 LC_07_03 LC_11_05
Estimation may not be feasible. Please check
data matrix
I must clarify that all the dataset has a range from 1 to 5. Is a Likert polytomous dataset
Finally, I try to use the plot function and it does not have any output, the system just keep loading ad-infinitum with no answer
>plotPImap(res, sorted=TRUE)
I would like to add the description of that particular function and the arguments:
>PCM(X, W, se = TRUE, sum0 = TRUE, etaStart)
#X
Input data matrix or data frame with item responses (starting from 0);
rows represent individuals, columns represent items. Missing values are
inserted as NA.
#W
Design matrix for the PCM. If omitted, the function will compute W
automatically.
#se
If TRUE, the standard errors are computed.
#sum0
If TRUE, the parameters are normed to sum-0 by specifying an appropriate
W.
If FALSE, the first parameter is restricted to 0.
#etaStart
A vector of starting values for the eta parameters can be specified. If
missing, the 0-vector is used.
I do not understand why is necessary to have a score beginning from 0, I think that that what the error is trying to say but I don't understand quite well that output.
I highly appreciate any hint that you can provide me
Feel free to ask for any information that could be useful to reach the solution to this issue
The problem is not caused by the fact that there are no items with 0-responses. The model automatically corrects this by centering the response scale categories on zero. (You'll notice that the PI-map that you linked to is centered on zero. Also, I believe the map you linked to is of dichotomous data. Polytomous data should include the scale categories on the PI-map, I believe.)
Without being able to see your data, it is impossible to know the exact cause though.
It may be that the model is not converging. That may be what this error was alluding to: Estimation may not be feasible. Please check data matrix. You could check by entering > res at the prompt. If the model was able to converge you should see something like:
Conditional log-likelihood: -2.23709
Number of iterations: 27
Number of parameters: 8
...
Does your data contain answers with decimal numbers? I found the same error, I solved it by using dplyr::dense_rank() function:
df_ranked <- sapply(df_decimal_data, dense_rank)
Worked.

R cluster analysis and dendrogram with correlation matrix

I have to perform a cluster analysis on a big amount of data. Since I have a lot of missing values I made a correlation matrix.
corloads = cor(df1[,2:185], use = "pairwise.complete.obs")
Now I have problems how to go on. I read a lot of articles and examples, but nothing really works for me. How can I find out how many clusters are good for me?
I already tried this:
dissimilarity = 1 - corloads
distance = as.dist(dissimilarity)
plot(hclust(distance), main="Dissimilarity = 1 - Correlation", xlab="")
I got a plot, but its very messy and I dont know how to read it and how to go on. It looks like this:
Any idea how to improve it? And what can I actually get out of it?
I also wanted to create a Screeplot. I read that there will be a curve where you can see how many clusters are correct.
I also performed a cluster analysis and choose 2-20 Clusters, but the results are so long, I have no idea how to handle it and what things are important to look on.
To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme.
The kgs is helpful to get the optimal number of clusters.
Following your code one would do:
clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")
So the optimal number of clusters according to the kgs function is the minimum value of op_k, as you can see in the plot.
You can get it with
min(op_k)
Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL.
Check this page for more methods.
Hope it helps you.
Edit
To find which is the optimal number of clusters, you can do
op_k[which(op_k == min(op_k))]
Plus
Also see this post to find the perfect graphy answer from #Ben
Edit
op_k[which(op_k == min(op_k))]
still gives penalty. To find the optimal number of clusters, use
as.integer(names(op_k[which(op_k == min(op_k))]))
I'm happy to learn about the kgs function. Another option is using the find_k function from the dendextend package (it uses the average silhouette width). But given the kgs function, I might just add it as another option to the package.
Also note the dendextend::color_branches function, to color your dendrogram with the number of clusters you end up choosing (you can see more about this here: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches )

How to relate back to original data points in a self organizing map

I am using R kohonen package for the implementation of SOM. I find trouble in relating the code vector resulted from the self organizing map back to the original data points. I tried to include labels with no weight during the training process, but the result was incomprehensible.
Is there a way to refer back to the original data points from each node after the training process is complete?
You will get the center and scaled values from
x= attr(som_model$data,"scaled:center")
y= attr(som_model$data,"scaled:scale")
To get original data back
First find the node
som_model$unit.classif will return wining nodes corresponding to total number of observations.
Suppose you want to find out data related to the nth node then,
for (i in 1:ncol(som_model$data)){
z[,i] = som_model$data[,i][som_model$unit.classif==n] * y[i]+x[i]
}
Corresponding to nth node you will get your original value back.

Clustering Cell Tower IDs from Known location names

I am new to data mining and I am trying to figure out how to cluster cell tower IDs to find its location from the known location labels (Home, Work, Elsewhere, No signal).
I have a location driven dataset of user A that contains cellID (Unique ID of detected celltowers), starttime (date & time it detected particular tower), endtime (last date & time before it connected to different celltower), placenames(user labelled place names such as home, work). There are unlabelled locations in dataset as well that are left empty by the user and I want to label these celltowers using clustering approach so that they represent as one of location names.
I am using R programming and I tried to feed complete dataset to kmeans clustering but it's resulting me with warning message which I completely dont have a clue why?
*Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(dataset, 4, 15) : NAs introduced by coercion*
Any suggestions on how can I use clustering approach for this problem? Thanks
Since you have all the labeled and unlabeled data available at the training stage, what you are looking for is "transductive learning", which is a little different from clustering (which is "unsupervised learning").
For each cell tower you collect the average starttime, endtime and cellID. You can get lat/lng from cellIDs here: https://mozilla-ichnaea.readthedocs.org/en/latest/api/search.html or http://locationapi.org/api (expensive).
This gives you 4-dimensional feature vectors for each tower, the goal is to assign a ternary labeling based on these continuous feautres:
[avg_starttime avg_endtime lat lng] home/work/other
I don't know about R, but in python basic transductive learning is available:
http://scikit-learn.org/stable/modules/label_propagation.html#label-propagation
If you don't get good results with label propagation, and since off-the-shelf transductive learning tools are rare, you might just want to ignore some of your data during training and use more standard methods. If you ignore the unlabeled data at the start you can have an "inductive" or "supervised" learning problem (solve with an SVM). If you ignore the labels at the start you can use unsupervised learning (eg "clustering"; use kmeans or DBSCAN) and then assign labels to the clusters after clustering is done.
I don't know how you got the NaN.

Resources