Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I'm trying to cluster the movies data set that comes with the package "ggplot2" in R. I will be using k-means. The column names that comes with this data set are :
[1] "title" "year" "length" "budget" "rating"
[6] "votes" "r1" "r2" "r3" "r4"
[11] "r5" "r6" "r7" "r8" "r9"
[16] "r10" "mpaa" "Action" "Animation" "Comedy"
[21] "Drama" "Documentary" "Romance" "Short"
Would you think it is a good idea to do clustering based on the movie genre ? I'm kind of lost and don't know where to start off. Any advice ?
You need to figure out what makes a good cluster.
There are millions of way to cluster this data set. Because you can preprocess the data differently, use different algorithms, distances, and so on.
Without your guidance, the clustering algorithm will just do something, and likely return a completely useless result!
So you need to first get a clear objective: what is a good clustering?
Then you can try to adapt the data such that the clustering algorithms optimize for this objective. For k-means, you need to do all of this in preprocessing. For hclust you can also choose distance functions that match your desires.
To answer your first question: Yes, I think that this is an interesting project. Working with this dataset could be a cool way to learn about different data mining techniques.
To answer your second question, here is some advice. Clustering is an unsupervised learning technique. Learning is unsupervised when the target variable (in this case, the target variable might be the genre of the movie) is unknown. However, looking at the columns that you listed, it seems like you do have the genre information. With that in mind, you have two options. First, you could pretend like you don't have the genre information. In this case you would apply k-means to the rest of the data. After the clustering is done, you could evaluate how well the algorithm done by comparing it to the known genre. Second, you could treat this problem as a classification problem. In this case, you would use the the genre information to learn a model that can predict the genre. You might already know this, but I just wanted to say it.
To give you some advice on the clustering problem specifically, I first would want to know what the 'r1', ..., 'r10' variables represent. Are they numeric variables or categorical? K-means has two steps: one where you assign data points to the centroid that is nearest to it and one where you calculate the new centroid by taking the mean of all data points in a cluster. Does taking the mean of these variables make sense?
With that in mind, I would recommend first choosing the variables that you want to use in the clustering algorithm. Then write the following functions: one that can calculate the distance between two points, one that can assign an observation to the nearest centroid, and one that can recalculate the centroids based on the assignments.
Related
I have a data frame S=[rows x cols] containing samples s such that rownames tell the names of the samples and colnames tell the features of the respective samples.
In this data frame, i have inserted one test sample t1. Now i want to extract the name and values of that particular sample "s" which is similar or twin of t1.
For this purpose i have used heatmaply() which plots hierarchical clustering. Observing this plot, i can see a cluster made at stage 1(of iterative hierarchical clustering process ) which contains only two members, one is my test sample t1 and the other member is its almost twin or almost similar sample.
Now i want to extract only that twin or sample s from the cluster made at stage 1 that was similar to my test sample t1 and nothing else. Please guide me in this regard.
I know a little about the hclust and dist functions. The problem with dist is that it provides too much information and i cant think of anyway of how to extract the twin of my test sample t1 from the dist matrix. I know a little bit about cutree(). To my less knowledge, i think it can give the clusters, to which the members belong, depending on the value of argument k. So when the value of k changes, the members of the clusters change. i am wondering if i can exploit cutree so that i can get the stage 1 clusters(contain two members) and find the member that is similar to my test sample t1.
The values of hclust object particularly interested me. For example merge and order. Maybe someone can guide me more about it and i can use them to get the twin
I am sorry for this long post. I was trying to explain as clearly as possible in a concise manner and wanted to show that i have tried my knowledge but your experience is highly appreciated to solve this problem.
I have time-series data of 12 consumers. The data corresponding to 12 consumers (named as a ... l) is
I want to cluster these consumers so that I may know which of the consumers have utmost similar consumption behavior. Accordingly, I found clustering method pamk, which automatically calculates the number of clusters in input data.
I assume that I have only two options to calculate the distance between any two time-series, i.e., Euclidean, and DTW. I tried both of them and I do get different clusters. Now the question is which one should I rely upon? and why?
When I use Eulidean distance I got following clusters:
and using DTW distance I got
Conclusion:
How will you decide which clustering approach is the best in this case?
Note: I have asked the same question on Cross-Validated also.
none of the timeseries above look similar to me. Do you see any pattern? Maybe there is no pattern?
the clustering visualizations indicate that there are no clusters, too. b and l appear to be the most unusual outliers; followed by d,e,h; but there are no clusters there.
Also try hierarchical clustering. The dendrogram may be more understandable.
But in either way, there may be no clusters. You need to be prepared for this outcome, and consider it a valid hypothesis. Double-check any result. As you have seen, pam will always return a result, and you have absolutely no means to decide which result is more "correct" than the other (most likely, neither is correct, and you should rely on neither, to answer your question).
I want to do a project on document summarization.
Can anyone please explain the algorithm for document summarization using graph based approach?
Also if someone can provide me links to few good research papers???
Take a look at TextRank and LexRank.
LexRank is an algorithm essentially identical to TextRank, and both use this approach for document summarization. The two methods were developed by different groups at the same time, and LexRank simply focused on summarization, but could just as easily be used for keyphrase extraction or any other NLP ranking task.
In both algorithms, the sentences are ranked by applying PageRank to the resulting graph. A summary is formed by combining the top ranking sentences, using a threshold or length cutoff to limit the size of the summary.
https://en.wikipedia.org/wiki/Automatic_summarization#Unsupervised_approaches:_TextRank_and_LexRank
I am attempting to cluster the behavioral traits of 250 species into life-history strategies. The trait data consists of both numerical and nominal variables. I am relatively new to R and to cluster analysis, but I believe the best option to find the distances for these points is to use the gower similarity method within the daisy function. 1) Is that the best method?
Once I have these distances, I would like to find significant clusters. I have looked into pvclust and like its ability to give me the strength of the cluster. However, I have not been able to modify the code to accept the distance measurements previously made using daisy. I have unsuccessfully tried to follow the advice given here https://stats.stackexchange.com/questions/10347/making-a-heatmap-with-a-precomputed-distance-matrix-and-data-matrix-in-r/10349#10349 and using the code obtained here http://www.is.titech.ac.jp/~shimo/prog/pvclust/pvclust_unofficial_090824/pvclust.R
2)Can anyone help me to modify the existing code to accept my distance measurements?
3) Or, is there another better way to determine the number of significant clusters?
I thank all in advance for your help.
Some comments...
About 1)
It is a good way to deal with different types of data.
You could also create as many new rows in the dataset as possible nominal values and put 1/0 where it is needed. For example if there are 3 nominal values such as "reptile", "mammal" and "bird" you could change your initial dataset that has 2 columns (numeric, Nominal)
for a new one with 4 columns (numeric, numeric( representing reptile), numeric(representing mammal), numeric(representing bird)) an instance (23.4,"mammal") would be mapped to (23.4,0,1,0).
Using this mapping you could work with "normal" distances (be sure to standardize the data so that no column dominates the others due to it's big/small values).
About 2)
daisy returns an element of type dissimilarity, you can use it in other clustering algorithms from the cluster package (maybe you don't have to implement more stuff). For example the function pam can get the object returned by daisy directly.
About 3)
Clusters are really subjective and most cluster algorithms depend on the initial conditions so "significant clusters" is not really a term that some people would not be comfortable using. Pam could be useful in your case because clusters are centered using medoids which is good for nominal data (because it is interpretable). K-means for example has the disadvantage that the centroids are not interpretable (what does it mean 1/2 reptile 1/2 mammal?) pam builds the clusters centered to instances which is nice for interpretation purposes.
About pam:
http://en.wikipedia.org/wiki/K-medoids
http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.html
You can use Zahn algorithm to find the cluster. Basically it's a minimum spanning tree and a function to remove the longest edge.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Recently I watched a lot of Stanford's hilarious Open Classroom's video lectures. Particularly the part about unsupervised Machine Learning got my attention. Unfortunately it stops were it might get even more interesting.
Basically I am looking to classify discrete matrices by an unsupervised algorithm. Those matrices just contain discrete values of the same range. Let's say I have 1000s of 20x15 matrices that with values ranging from 1-3. I just started to read through the literature and I feel that image classification is way more complex (color histograms) and that my case is rather a simplification of what is done there.
I also looked at the Machine Learning and Cluster Cran Task Views but do not know where to start with a practical example.
So my question is: which package / algorithm would be a good pick to start playing around and working on the problem in R?
EDIT:
I realized that I might have been to imprecise: My matrix contains discrete choice data – so mean clustering might(!) not be the right idea. I do understand with what you said about vectors and observation but I am hoping for some function that accepts matrices or data.frames, because I have several observations over time.
EDIT2:
I realize that a package / function, introduction that focuses on unsupervised classification of categorical data is what would help me the most right now.
... classify discrete matrices by an unsupervised algorithm
You must mean cluster them. Classification is commonly done by supervised algorithms.
I feel that image classification is way more complex (color histograms) and that my case is rather a simplification of what is done there
Without knowing what your matrices represent, it's hard to tell what kind of algorithm you need. But a starting point might be to flatten your 20*15 matrices to produce length-300 vectors; each element of such a vector would then be a feature (or variable) to base a clustering on. This is the way must ML packages, including the Cluster package you link to, work: "In case of a matrix or data frame, each row corresponds to an observation, and
each column corresponds to a variable."
So far I found daisy from the cluster package respectively the argument "gower" which refers to Gower's similarity coefficient to handle multiple modes of data. Gower seems to be a fairly only distance metric, still it's what I found for use with categorical data.
You might want to start from here : http://cran.r-project.org/web/views/MachineLearning.html