Pre-defining clusters in r - r

I have a pretty big data table (about 100.000 observations) that I'd like to use for clustering. Since some of the data is categorical, I've tried using "gower distance" and then hclust() with the "ward" method.
The data itself is very heterogeneous, which is why I'd like to sort of "pre-cluster" the data and then do the actual cluster analysis. Have any of you done this before and can point me in the right direction? I'm at a loss at the moment :(
With the mentioned methods, I don't really get useful clusters.
Thanks guys, I really appreciate every tip I can get.
Edit: I think that I didn't really explain my problem right, so here's another attempt: let's say, that I have a dataset containing brands of cars and some of their features. Before clustering them by features I would like to precluster them by brand. So all BMW e.g. are in the same cluster and so on.. and only after that I would like to cluster by features, so I should get a cluster with fast cars etc.
does anybody know, how to do this in R?
this does not describe my dataset, but maybe the question I'm having is clearer now.

You should start with a sample first.
Once you get good results on the sample, try to reproduce it on a different sample. Once the results are stable, you can either try to scale the algorithm to the entire data set (maybe try doubling first), or you can train a classifier and predict the clusters of the remaining data. With most clustering algorithms, a 1 nearest neighbor classifier will be very good.

Related

How to run a cluster on data that is strings only R

I am trying to run a cluster on a very large data set. It contains only strings for values. I have removed the NA's and relaced with a dummy value. My K-Means in R keeps failing due to NA coerecion. How would the community run a cluster on this data. I am shwoing 10 rows of a dummy example below. In this situation lets call the data frame: cluster_data
ANy help would be greatly appreciated. I am trying see if any of the columns cause the data to break earlier then another to try and understand a possible struture. Thought Clustering with K-means was best approach but do not see how to do with strings. Have converted to factors in R and still have issues. ANy example code is greatly appreciated
Question: how do you run kmeans clustering with strings?
Answer: You can't run k means cluster analysis on categorical data. You need data that a distance function can make sense of.
K-means is designed for continuous variables, where least-squares and the mean make sense to be used as centers.
For other data types, it is better to sue other algorithms, such as PAM, HAC, DBSCAN, OPTICS, ...

Which cluster methodology should I use for a multidimensional dataset?

I am trying to create clusters of countries with a dataset quite heterogeneous (the data I have on countries goes from median age to disposable income, including education levels).
How should I approach this problem?
I read some interesting papers on clustering, using K-means for instance, but it seems those algorithms are mostly used when there are two sets of variables, not 30 like in my case, and when the variables are comparable (it might be though to try to cluster countries with such diversity in the data).
Should I normalise some of the data? Should I just focus on fewer indicators to avoid this multidimensional issue? Use spectral clustering first?
Thanks a lot for the support!
Create a "similarity metric". Probably just a weight to all your measurements, but you might build in some corrections for population size and so on. Then you can only have low hundreds of countries, so most brute force methods will work. Hierarchical clustering would be my first point of call, and that will tell you if the data is inherently clustered.
If all the data is quantitative, you can normalise on 0 - 1 (lowest country is 0, highest is 1), then take eigenvectors. Then plot out the first two axes in eigenspace. That will give another visual fix on clusters.
If it's not clustered, however, it's better to admit that.

Cluster your time-series data

I have time-series data of 12 consumers. The data corresponding to 12 consumers (named as a ... l) is
I want to cluster these consumers so that I may know which of the consumers have utmost similar consumption behavior. Accordingly, I found clustering method pamk, which automatically calculates the number of clusters in input data.
I assume that I have only two options to calculate the distance between any two time-series, i.e., Euclidean, and DTW. I tried both of them and I do get different clusters. Now the question is which one should I rely upon? and why?
When I use Eulidean distance I got following clusters:
and using DTW distance I got
Conclusion:
How will you decide which clustering approach is the best in this case?
Note: I have asked the same question on Cross-Validated also.
none of the timeseries above look similar to me. Do you see any pattern? Maybe there is no pattern?
the clustering visualizations indicate that there are no clusters, too. b and l appear to be the most unusual outliers; followed by d,e,h; but there are no clusters there.
Also try hierarchical clustering. The dendrogram may be more understandable.
But in either way, there may be no clusters. You need to be prepared for this outcome, and consider it a valid hypothesis. Double-check any result. As you have seen, pam will always return a result, and you have absolutely no means to decide which result is more "correct" than the other (most likely, neither is correct, and you should rely on neither, to answer your question).

k-means clustering with new training data?

I'm working on some image recognition stuff and are trying to use k-means for matching algorithms.
Actually, I have lots of vectors (exactly speaking, SURF descriptors) on database and I would like to cluster them for future matching processes.
However, the problem is, I believe that the training dataset is going to grow (new training data may come), which make it impossible for me to train these data in one run.
It would be OK to cluster some data first, but does it mean that every new data need a full re-clustering? If I'm confident enough on existing clusters, does a minority of extra data (ex. 1% extra of all data) hurt the cluster?
K-means is not a particularly smart algorithm. And on SIFT vectors, the results are often not much better than random convex partitions anyway.
If your initial sample was representative, there should be no need to rerun the clustering: the new data should have little effect on the centroids anyway.
To speed up the clustering, you can also re-use the previous centroids as initial seeds. This should require much less iterations then.

Clustering time series in R

i have a problem with clustering time series in R.
I googled a lot and found nothing that fits my problem.
I have made a STL-Decomposition of Timeseries.
The trend component is in a matrix with 64 columns, one for every series.
Now i want to cluster these series in simular groups, involve the curve shapes and the timely shift. I found some functions that imply one of these aspects but not both.
First i tried to calculte a distance matrix with the dtw-distance so i
found clusters based on the values and inply the time shift but not on the shape of the timeseries. After this i tried some correlation based clustering, but then the timely shift
we're not recognized and the result dont satisfy my claims.
Is there a function that could cover my problem or have i to build up something
on my own. Im thankful for every kind of help, after two days of tutorials and examples i totaly uninspired. I hope i could explain the problem well enough to you.
I attached a picture. Here you can see some example time series.
There you could see the problem. The two series in the middle are set to one cluster,
although the upper and the one on the bottom have the same shape as one of the middle.
Have you tried the R package dtwclust
https://cran.r-project.org/web/packages/dtwclust/index.html
(I'm just starting to explore this package, but it seems like it covers a lot of aspects of time series clustering and it has lots of good references.)
you can use the kml package. It is used specifically to longitudinal data. You can consult its help. It has the next example:
### Generation of some data
cld1 <- generateArtificialLongData(25)
### We suspect 3, 4 or 6 clusters, we want 3 redrawing.
### We want to "see" what happen (so printCal and printTraj are TRUE)
kml(cld1,c(3,4,6),3,toPlot='both')
### 4 seems to be the best. We want 7 more redrawing.
### We don't want to see again, we want to get the result as fast as possible.
kml(cld1,4,10)
Example cluster

Resources