how to rank the characteristic of cluster in r - r

This is the result of cluster analysis through the k-means function.
>weseg2<-read.csv("WE_SEG DATA.csv",header=TRUE)
>training.data2<-scale(weseg2)
>aaaa<-kmeans(training.data2, centers=4, iter.max=10000, nstart=20)
I want to know what characteristics each cluster has.
So I got the average of each variable by clusters.
this is my code that calculate the mean of variables.
mean of first cluster
>rank1<-colMeans(training.data2[aaaa$cluster==1,])
mean of second cluster
>rank2<-colMeans(training.data2[aaaa$cluster==2,])
mean of third cluster
>rank3<-colMeans(training.data2[aaaa$cluster==3,])
mean of fourth cluster
>rank4<-colMeans(training.data2[aaaa$cluster==4,])
If so,what code should be entered so taht it can rank the clusters for each variable?
For example, if you have the variables a,b and c, you have rank of four clusters in the a variable, and rank of four clusters in the b and c variables.

Use the apply and rank function.
like this
>rank5<-cbind(rank1,rank2,rank3,rank4)
>apply(rank5,1,rank)
Then, you can get a rank of columns.
And if you want to get rank as decrease, use the order() function.
good luck.

Related

Apply k-means to examine differences between two groups in R

I have two groups. The treatment group is exposure to media; the control group is no media. They are distinguished by a categorial variable in the data frame. (exposure to media = 1, no media = 0)
Now, I want to examine whether there are any clear differences between these two groups. To do this, apply the k-means algorithm with two clusters to four variables (proportion of black population, proportion of male population, proportion of hispanic population, median income on the logarithmic scale).
How to do this in R? Could anyone give some hints? Thanks!
Try this:
km <-kmeans(your data, 2, nstart=10)
your data here as a data.frame (your whole data or you can select the variables that you are interesting about them). You need to select the number of clusters (here is 2). A good practice to understand your data is to apply different number of cluster and then see which one fit your data better (use for example any criteria methods such as AIC or BIC).
k-means is an approach applied to cluster data. Where this data come from different distribution and we would like to know from where each observation come from (from which distribution).
You can also have a look at many tutorials about kmeans in R. For example,
https://onlinecourses.science.psu.edu/stat857/node/125
https://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/
http://www.statmethods.net/advstats/cluster.html

Weighted observation frequency clustering using hclust in R

I have a large matrix of 500K observations to cluster using hierarchical clustering. Due to the large size, i do not have the computing power to calculate the distance matrix.
To overcome this problem I chose to aggregate my matrix to merge those observations which were identical to reduce my matrix to about 10K observations. I have the frequency for each of the rows in this aggregated matrix. I now need to incorporate this frequency as a weight in my hierarchical clustering.
The data is a mixture of numerical and categorical variables for the 500K observations so i have used the daisy package to calculate the gower dissimilarity for my aggregated dataset. I want to use hclust in the stats package for the aggregated dataset however i want to take into account the frequency of each observation. From the help information for hclust the arguments are as follows:
hclust(d, method = "complete", members = NULL)
The information for the members argument is:, NULL or a vector with length size of d. See the ‘Details’ section. When you look at the details section you get: If members != NULL, then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means.
From the above description, i am unsure if i can assign my frequency weights to the members arguments as it is not clear if this is the purpose of this argument. I would like to use it like this:
hclust(d, method = "complete", members = df$freq)
Where df$freq is the frequency of each row in the aggregated matrix. So if a row is duplicated 10 times this value would be 10.
If anyone can help me that would be great,
Thanks
Yes, this should work fine for most linkages, in particular single, group average and complete linkage. For ward etc. you need to correctly take the weights into account yourself.
But even that part is not hard. Just make sure to use the cluster sizes, because you need to pass the distance of two clusters, not two points. So the matrix should contain the distance of n1 points at location x and n2 points at location y. For min/max/mean this n disappears or cancels out. For ward, you should get a SSQ like formula.

R: Putting Variables in order by a different variable

Once again I have been set another programming task and to most of which I have done, so a quick run through: I've had to take n amount of samples of multivariate normal distribution with dimension p (called it X) then to put it into a matrix (Matx) where the first two values in each row were taken and summed a long with a value randomly drawn from the standard normal distribution. (Call this vector Y) Then we had to order Y numerically and split it up into H groups, and then I had to find out the mean of each row in the matrix and now having to order then in terms of which Y group they were associated. I've struggled a fair bit and have now hit a brick wall. Quite confusing I understand, if anyone could help it'd be greatly appreciated!
Task:Return the pxH matrix which has in the first column the mean of the observations in the first group and in the Hth column the mean in the observations in the Hth group.
Code:
library('MASS')
x<-mvrnorm(36,0,1)
Matx<-matrix(c(x), ncol=6, byrow=TRUE)
v<-rnorm(6)
y1<-sum(x[1:2],v[1])
y2<-sum(x[7:8],v[2])
y3<-sum(x[12:13],v[3])
y4<-sum(x[19:20],v[4])
y5<-sum(x[25:26],v[5])
y6<-sum(x[31:32],v[6])
y<-c(y1,y2,y3,y4,y5,y6)
out<-order(y)
h1<-c(out[1:2])
h2<-c(out[3:4])
h3<-c(out[5:6])
x1<-c(x[1:6])
x2<-c(x[7:12])
x3<-c(x[13:18])
x4<-c(x[19:24])
x5<-c(x[25:30])
x6<-c(x[31:36])
mx1<-mean(x1)
mx2<-mean(x2)
mx3<-mean(x3)
mx4<-mean(x4)
mx5<-mean(x5)
mx6<-mean(x6)
d<-c(mx1,mx2,mx3,mx4,mx5,mx6)[order(out)]
d

Looking for an efficient way to compute the variances of a multinomial distribution in R

I have an R matrix which dimensions are ~20,000,000 rows by 1,000 columns. The first column represents counts and the rest of the columns represent the probabilities of a multinomial distribution of these counts. So in other words, in each row the first column is n and the rest of the k columns are the probabilities of the k categories. Another point is that the matrix is sparse, meaning that in each row there are many columns with value of 0.
Here's a toy matrix I created:
mat=rbind(c(5,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),c(2,0.2,0.2,0.2,0.2,0.2,0,0,0,0,0),c(22,0.4,0.6,0,0,0,0,0,0,0,0),c(5,0.5,0.2,0,0.1,0.2,0,0,0,0,0),c(4,0.4,0.15,0.15,0.15,0.15,0,0,0,0,0),c(10,0.6,0.1,0.1,0.1,0.1,0,0,0,0,0))
What I'd like to do is obtain an empirical measure of the variance of the counts for each category. The natural thing that comes to mind is to obtain random draws and then compute the variances over them. Something like:
draws = apply(mat,1,function(x) rmultinom(samples,x[1],x[2:ncol(mat)]))
Where say samples=100000
Then I can run an apply over draws to compute the variances.
However, for my real data dimensions this will become prohibitive at least in terms of RAM. Is whether a more efficient solution in R to this problem?
If all you need is the variance of the counts, just compute it immediately instead of returning the intermediate simulated draws.
draws = apply(mat,1,function(x) var(rmultinom(samples,x[1],x[2:ncol(mat)])))

Cluster Variables Against Single Outcome Variable - ClustOfVar

I have a dataset with variables representing both scores and traits (mix of qualitative and quantitative on both counts). I want to cluster the traits (not the individual observations) according to each of the scores. So, I want to form clusters of traits (trait_1 through trait_15) that are similar on the basis of score_1, then repeat for scores 2 and 3. Example of the data structure below.
I am thinking that I can use the ClustOfVar package to form these clusters, which I would understand if I were just trying to cluster all of the variables into like groups. However, I don't know how to cluster them on the basis of one of the other variables.
If anyone has suggestions, I'd appreciate it. Thanks in advance.
Score_1 Score_2 Score_3 Trait_1 Trait_2 Trait_3 … Trait_15
n1
n2
n3
…
n100000
You may want to look into subspace clustering algorithms.
They usually allow overlapping clusters, so you may get out quite a number of clusters.
You cluster on the traits only, then check if the found clusters correspond to your known scores in a second phase.

Resources