I have some clinical data that I ran an LDA analysis on - 98 patients and their corresponding protein levels for a specific type of analysis. I plotted the output and have essentially 6 different "clusters" of individuals. Here's the problem - some of the clusters have outliers that are much closer to other clusters. We know that in some cases the medical diagnosis that the cluster is based on (phenotype) might be in error, and the outlier member of the green cluster that appears closer to the blu cluster might actually be part of the blue cluster. So my question is: is there a computational way for me detect these individuals and re-evaluate whether or not they are actually members of the cluster they are closer to based on the data if we ignore the medical diagnosis?
Here's a sample of the output:
Related
I've been recently studying DBSCAN with R for transit research purposes, and I'm hoping if someone could help me with this particular dataset.
Summary of my dataset is described below.
BTIME ATIME
1029 20001 21249
2944 24832 25687
6876 25231 26179
11120 20364 21259
11428 25550 26398
12447 24208 25172
What I am trying to do is to cluster these data using BTIME as x axis, ATIME as y axis. A pair of BTIME and ATIME represents the boarding time and arrival time of a subway passenger.
For more explanation, I will add the scatter plot of my total data set.
However if I split my dataset in different smaller time periods, the scatter plot looks like this. I would call this a sample dataset.
If I perform a DBSCAN clustering on the second image(sample data set), the clustering is performed as expected.
However it seems that DBSCAN cannot perform cluster on the total dataset with smaller scales. Maybe because the data is too dense.
So my question is,
Is there a way I can perform clustering in the total dataset?
What criteria should be used to separate the time scale of the data
I think the total data set is highly dense, which was why I tried clustering on a sample time period.
If I seperate my total data into smaller time scale, how would I choose the hyperparameters for each seperated dataset? If I look at the data, the distribution of the data is similar both in the total dataset and the seperated sample dataset.
I would sincerely appreciate some advices.
I have basketball player data that looks like the following:
Player Weight Height Shots School
A NA 70 23 AB
B 130 62 10 AB
C 180 66 NA BC
D 157 65 22 CD
and I want to do unsupervised and supervised(based on height) clustering. Looking into online resources I found that I can use kmeans for unsupervised but I don't know how to handle NAs without losing a good amount of data. I also don't know how to handle the quantitative variable "school". Are there any ways to resolve both issues for unsupervised and supervised clustering?
K-means cannot be used for categorical data. One work around would be to instead use data about the schools such as # of enrollments or local SES data.
kmeans() in R cannot handle NA's so you could either omit them (and you should check that the NA's are distributed fairly evenly among other factors) or look into using cluster::clara() from the cluster library.
You have not asked anything specifically about super-learning so I cannot address that part of the question.
The problem you are facing is known as missing data. And you have to decide about it before start the clustering. in most cases the samples with missed data (NAs here) are simple omitted. that happen in preparing data and clearing process step of data mining. In R you can do it using the following code:
na.omit(yourdata)
it omit the records or samples (in row) that contains NAs.
but if you want to include them in the clustering process you can use the average value of that feature in entire cluster for the missing value option.
in your case, Consider weight:
for player A you can set (130+180+157)/3 for his weight.
For another question: it seems you are a little bit confused about the meaning of supervised and unsupervised learning. in supervised learning you need to define the class label of the samples. then you build a model (classifier) and train it to learn about each class of samples and after training you can use the model to predict the label of a test sample, like you give it a player with this values (W=100,H=190,shots=55) and it will give you the predicted class label.
For unsupervised learning you just need to cluster the data to find group or cluster relation of samples. for doing this you do not need a class label, you should define the features that you are going to cluster the samples based on them, for example you can cluster players only based on their weights, or just cluster them based on their height,... or you can use all height, weight and shots features for clustering. this is possible in R using the following code:
clus <- kmeans(na.omit(data$weight), 5) #for cluster them to 5 clusters based on weight
clus <- kmeans(na.omit(data[,1:3]),5) # to cluster them based on weight, height, shots into 5 clusters.
consider the using of na.omit here that remove rows which has NAs in their columns.
let me know if this helps you.
I have a dataset of the number of stranded turtles reported at a variety of locations along the Queensland coast of Australia. What I would like to find out is the number of stranded turtles that are NOT reported at each of these locations. In order to estimate that number, I have collected data on the frequency with which a turtle is reported to a stranding location; i.e. how often is a single turtle stranding reported more than one time at about 20 points along the coast? So I have count data which indicates the number of turtles that are reported to a stranding location one time, two times, or three or more times. Ultimately I would like to relate these data to covariates such as local population density and distance to the nearest road, in order to predict the "zero reporting" incidence for the rest of the coastal areas as well.
My data should look something like this, then:
loc<-c("A","B","C")
rep1<-c(51,24,10)
rep2<-c(4,8,3)
rep3ormore<-c(2,1,0)
pop<-c(50,1000,100)
turtle- cbind.data.frame(loc, rep1, rep2, rep3ormore, pop)
There are other possible covariates, but I'll keep it simple for now! I think this should be able to be done using a Poisson distribution, but I'm having trouble wrapping my head around how to do it.
Additionally, in certain instances I don't have exact numbers for the turtles that have been reported, but instead I have categories; 4-6, 7-10, >10, etc. If there's a way to model that possibility, that would be great as well!
I want to do clustering of my data (kmeans or hclust) in R language (coding). My data is ordinal, which means that the data is Likert scale to measure the causes of cost escalation (I have 41 causes "variables") that scaled from 1 to 5, which 1 is no effect to 5 major effect (I have about 160 observations "who rank the causes")... any help of how to cluster the 41 cause based on the observations ... do I have to convert the scale to percentage or z score before clustering or any thing that help ...... I really need your help!! here is the data to play with https://docs.google.com/spreadsheet/ccc?key=0AlrR2eXjV8nXdGtLdlYzVk01cE96Rzg2NzRpbEZjUFE&usp=sharing
I want to cluster the variables (the columns) in terms of similarity of occurrence in observations... I follow the code in statmethods.net/advstats/cluster.html; but I couldn't cluster the variables (the columns) in terms of similarity of occurrence in observations and also I follow the work at mattpeeples.net/kmeans.html#help; but I don't know why he convert the data to percentage and then to Z-score standardize.
It isn't clear to me if you want to cluster the rows (the observations) in terms of similarity in the variables, or cluster the variables (the columns) in terms of similarity of occurrence in observations?
Anyway, see package cluster. This is a recommended package that comes with all R installations.
Read ?daisy for details of what is done with ordinal data. This metric can be used in functions such as agnes (for hierarchical clustering) or pam (for partitioning about medoids, a more robust version of k-means).
By default, these will cluster the rows/observations. Simply transpose the data object using t() if you want to cluster the columns (variables). Although that may well mess up the data depending on how you have stored them.
Converting the data to percentage is called normalization of data so all the variables are in the range of 0 - 1.
If data is not normalized you run the risk of bias towards dimensions with large values
I have a dataset with variables representing both scores and traits (mix of qualitative and quantitative on both counts). I want to cluster the traits (not the individual observations) according to each of the scores. So, I want to form clusters of traits (trait_1 through trait_15) that are similar on the basis of score_1, then repeat for scores 2 and 3. Example of the data structure below.
I am thinking that I can use the ClustOfVar package to form these clusters, which I would understand if I were just trying to cluster all of the variables into like groups. However, I don't know how to cluster them on the basis of one of the other variables.
If anyone has suggestions, I'd appreciate it. Thanks in advance.
Score_1 Score_2 Score_3 Trait_1 Trait_2 Trait_3 … Trait_15
n1
n2
n3
…
n100000
You may want to look into subspace clustering algorithms.
They usually allow overlapping clusters, so you may get out quite a number of clusters.
You cluster on the traits only, then check if the found clusters correspond to your known scores in a second phase.