I am new to R and the clustering world. I am using a shopping dataset to extract features from it in order to identify something meaningful.
So far I have managed to learn how to merge files, remove na., do the sum of errors squared, workout the mean values, summarise by group, do the K means clustering and plot the results X, Y.
However, I am very confused on how to view these results or identify what would be a useful cluster? Am i repeating something or missing out on something? I get confused with plotting X Y variables aswell.
Below is my code, maybe my code might be wrong. Could you please help. Any help would be great.
# Read file
mydata = read.csv(file.choose(), TRUE)
#view the file
View(mydata)
#create new data set
mydata.features = mydata
mydata.features <- na.omit(mydata.features)
wss <- (nrow(mydata.features)-1)*sum(apply(mydata.features,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(mydata.features, centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# K-Means Cluster Analysis
fit <- kmeans(mydata.features, 3)
# get cluster means
aggregate(mydata.features,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata.features <- data.frame(mydata.features, fit$cluster)
results <- kmeans(mydata.features, 3)
plot(mydata[c("DAY","WEEK_NO")], col= results$cluster
Sample data Variables, below are all the variables I have within my dataset, its shopping dataset collected over 2 years
PRODUCT_ID - uniquely identifies each product
household_key - uniquely identifies each household
BASKET_ID - uniquely identifies a purchase occasion
DAY - day when transaction occured
QUANTITY - number of products purchased during the trip
SALES_VALUE - amount of dollar retailers receive from sales
STORE_ID - identifies unique stores
RETAIL_DISC - disccount applied due to manufacture coupon
TRANS_TIME - time of day when the transaction occurred
WEEK_NO - week of transaction occurred 1-102
MANUFACTURER - code that links products with same manufacture together
DEPARTMENT - groups similar products together
BRAND - indicates private or national label band
COMMODITY_DESC - groups similar products together at the lower level
SUB_COMMODITY_DESC - groups similar products together at the lowest level
Sample Data
I put together some sample data, so I can help you better:
#generate sample data
sampledata <- matrix(data=rnorm(200,0,1),50,4)
#add ID to data
sampledata <-cbind(sampledata, 1:50)
#show data:
head(sampledata)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.72859559 -2.2864943 -0.5408501 0.1564730 1
[2,] 0.34852943 0.3100891 0.6007349 -0.5985266 2
[3,] -0.04605026 0.5067896 -0.2911211 -1.1617171 3
[4,] -1.88358617 1.3739440 -0.5655383 0.9518367 4
[5,] 0.35528650 -1.7482304 -0.3871520 -0.7837712 5
[6,] 0.38057682 0.1465488 -0.6006462 1.3827544 6
I have a matrix with data points. Each data point has 4 variables (column 1 - 4) and an id (column 5).
Apply K-means
After that I apply the k-means function (but only to column 1:4 since it doesnt make much sense to cluster the id):
#kmeans (4 centers)
result <- kmeans(sampledata[,1:4], 4)
Analyse output
if i want to see what data point belongs to which cluster i can type:
result$cluster
The result will be for example:
[1] 4 3 2 2 1 2 4 4 3 3 3 3 2 1 4 4 4 2 4 4 4 1 1 1 3 3 3 3 1 3 2 2 4 4 2 4 2 3 1 2 2 2 1 2 1 1 4 1 1 1
This means that data point 1 belongs to cluster 4. The second data point belongs to cluster 3, and so on...
If I want to retrieve all data points that are in cluster 1, i can do the following:
sampledata[result$cluster==1,]
This will output a matrix, with all the values and the Data Point Id in the last Column:
[,1] [,2] [,3] [,4] [,5]
[1,] 0.3552865 -1.748230422 -0.3871520 -0.78377121 5
[2,] 0.5806156 0.479576142 1.1314052 1.60730796 14
[3,] 1.1871472 1.280881477 -1.7227361 -0.89045074 22
[4,] 0.8482060 0.726470349 0.6851352 -0.78526581 23
[5,] -0.5324139 -1.745802580 0.6779943 0.99915708 24
[6,] 0.2472263 -0.006298136 -0.1457003 -0.44789364 29
[7,] 0.1412812 -0.247076976 0.9181507 -0.58570904 39
[8,] 0.1859786 -1.768692166 0.5681229 -0.80618157 43
[9,] -1.1577178 -0.179886998 1.5183880 0.40014071 45
[10,] 1.0667566 -1.602875994 0.6010581 -0.49514049 46
[11,] 0.2464646 1.226129859 -1.3628096 -0.37666716 48
[12,] 1.2660358 0.282688323 0.7650636 0.23442255 49
[13,] -0.2499337 0.855327072 0.2290221 0.03492807 50
If i want to know how many data points are in cluster 1, I can type:
sum(result$cluster==1)
This will return 13, and corresponds to the number of lines in the matrix above.
Finally some plotting:
First, lets plot the data. Since you have a multidimensional dataframe, and you can only plot two dimensions in a standard plot, you have to do it like this. Select the variables you want to plot, For example var 2 and 3 (column 2 and 3). This corresponds to:
sampledata[,2:3]
To plot this data, simply write:
plot(sampledata[,2:3], col=result$cluster ,main="Affiliation of observations")
use the argumemnt col (this stands for colors) to give the data points a color accordingly to their cluster affiliation by typing col= result$cluster
If you also want to see the cluster centers in the plot, add the following line:
+ points(result$centers, col=1:4, pch="x", cex=3)
The plot should now look like this (for variable 2 vs variable 3):
(The dots are the data points, the X´s are the cluster centers)
I am not really familiar with the k-means function, and its hard to help without any sample data. Here however is something that might help:
kmeans returns an object of class "kmeans" which has a print and a fitted method. It is a list with at least the following components:
cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
centers: A matrix of cluster centres.
totss: The total sum of squares.
withinss: Vector of within-cluster sum of squares, one component per cluster.
tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
betweenss: The between-cluster sum of squares, i.e. totss-tot.withinss.
size: The number of points in each cluster.
iter: The number of (outer) iterations.
ifault: integer: indicator of a possible algorithm problem – for experts.
more here.
You can access these components like this:
I.e. if you want to have a look at the clusters:
results$cluster
Or have more details about the centers:
results$centers
Related
I'm trying to reduce the input data size by first performing a K-means clustering in R then sample 50-100 samples per representative cluster for downstream classification and feature selection.
The original dataset was split 80/20, and then 80% went into K means training. I know the input data has 2 columns of labels and 110 columns of numeric variables. From the label column, I know there are 7 different drug treatments. In parallel, I tested the elbow method to find the optimal K for the cluster number, it is around 8. So I picked 10, to have more data clusters to sample for downstream.
Now I have finished running the model <- Kmeans(), the output list got me a little confused of what to do. Since I have to scale only the numeric variables to put into the kmeans function, the output cluster membership don't have that treatment labels anymore. This I can overcome by appending the cluster membership to the original training data table.
Then for the 10 centroids, how do I find out what the labels are? I can't just do
training_set$centroids <- model$centroids
And most important question, how do I find 100 samples per cluster that are the closeted to their respective centroid?? I have seen one post here in python but no R resources yet.
Output 50 samples closest to each cluster center using scikit-learn.k-means library
Any pointers?
First we need a reproducible example of your data:
set.seed(42)
x <- matrix(runif(150), 50, 3)
kmeans.x <- kmeans(x, 10)
Now you want to find the observations in original data x that are closest to the centroids computed and stored as kmeans.x. We use the get.knnx() function in package FNN. We will just get the 5 closest observations for each of the 10 clusters.
library(FNN)
y <- get.knnx(x, kmeans.x$centers, 5)
str(y)
# List of 2
# $ nn.index: int [1:10, 1:5] 42 40 50 22 39 47 11 7 8 16 ...
# $ nn.dist : num [1:10, 1:5] 0.1237 0.0669 0.1316 0.1194 0.1253 ...
y$nn.index[1, ]
# [1] 42 38 3 22 43
idx1 <- sort(y$nn.index[1, ])
cbind(idx1, x[idx1, ])
# idx1
# [1,] 3 0.28614 0.3984854 0.21657
# [2,] 22 0.13871 0.1404791 0.41064
# [3,] 38 0.20766 0.0899805 0.11372
# [4,] 42 0.43577 0.0002389 0.08026
# [5,] 43 0.03743 0.2085700 0.46407
The row indices of the nearest neighbors are stored in nn.index so for the first cluster, the 5 closest observations are 42, 38, 3, 22, 43.
I am dealing with the problem that I need to count unique names of people in a string, but taking into consideration that there may be slight typos.
My thought was to set strings below a certain threshold (e.g. levenshtein distance below 2) as being equal. Right now I manage to calculate the string distances, but not making any changes to my input string that would get me the correct number of unique names.
library(stringdist);library(stringr)
names<-"Michael, Liz, Miichael, Maria"
names_split<-strsplit(names, ", ")[[1]]
stringdistmatrix(names_split,names_split)
[,1] [,2] [,3] [,4]
[1,] 0 6 1 5
[2,] 6 0 7 4
[3,] 1 7 0 6
[4,] 5 4 6 0
(number_of_people<-str_count(names, ",")+1)
[1] 4
The correct value of number_of_people should be, of course, 3.
As I am only interested in the number of uniques names, I am not concerned if "Michael" becomes replaced by "Miichael" or the other way round.
One option is to try to cluster the names based on their distance matrix:
library(stringdist)
# create a 'dist' object (=lower triangular part of distance matrix)
d <- stringdistmatrix(names_split,method="osa")
# use hierarchical clustering to group nearest neighbors
hc <- hclust(d)
# visual inspection: y-axis labels the distance value
plot(hc)
# decide what distance value you find acceptable for grouping.
cutree(hc, h=3)
Depending on your actual data you will need to experiment with the distance type (qgrams/cosine may be useful, or the jaro-winkler distance in the case of names).
I have two clustering results for the same variables but with different values each time. Let us create them with the following code:
set.seed(11)
a<-matrix(rnorm(10000),ncol=100)
colnames(a)<-(c(1:100))
set.seed(31)
b<-matrix(rnorm(10000),ncol=100)
colnames(b)<-colnames(a)
c.a<-hclust(dist(t(a)))
c.b<-hclust(dist(t(b)))
# clusters
groups.a<-cutree(c.a, k=15)
# take groups names
clus.a=list()
for (i in 1:15) clus.a[[i]] <- colnames(a)[groups.a==i]
# see the clusters
clus.a
groups.b<-cutree(c.b, k=15)
clus.b=list()
for (i in 1:15) clus.b[[i]] <- colnames(b)[groups.b==i]
# see the clusters
clus.b
What I get from that is two lists, clus.a and clus.b with the names (here just numbers from 1 to 100) of each cluster's variables.
Is there any way to examine if and which of the variables are clustered together in both clusterings? Meaning, how can I see if I have variables (could be teams of 2, 3, 4 etc) in same clusters for both clus.a and clus.b (doesn't have to be in the same cluster number).
If I understand your question correctly, you want to know if there are any clusters in a which have exactly the same membership as any of the clusters in b. Here's one way to do that.
Note: AFAICT in your example there are no matching clusters in a and b, so we create a few artificially to demo the solution.
# create artificial matches
clus.b[[3]] <- clus.a[[2]]
clus.b[[10]] <- clus.a[[8]]
clus.b[[15]] <- clus.a[[11]]
f <- function(a,b) (length(a)==length(b) & length(intersect(a,b))==length(a))
result <- sapply(clus.b,function(x)sapply(clus.a,f,b=x))
which(result, arr.ind=TRUE)
# row col
# [1,] 2 3
# [2,] 8 10
# [3,] 11 15
So this loops through all the clusters in b (sapply(clus.b,...)) and for each, loops through all the clusters in a looking for an exact match (in arbitrary order). For there to be a match, both clusters must have the same length, and the intersection of the two must contain all the elements in either - hence have the same length. This process produces a logical matrix with rows representing a and columns representing b.
Edit: To reflect the fact that OP is changing the question.
To detect clusters with two or more common elements, use:
f <- function(a,b) length(intersect(a,b))>1
result <- sapply(clus.b,function(x)sapply(clus.a,f,b=x))
matched <- which(result, arr.ind=TRUE)
matched
# row col
# [1,] 4 1
# [2,] 8 1
# [3,] 11 1
# [4,] 3 2
# ...
To identify which elements were present in both:
apply(matched,1,function(r) intersect(clus.a[[r[1]]],clus.b[[r[2]]]))
I have two sets of matrices. Each matrix is 100x100 in dimension and I have 240 of them (imagine each matrix was collected in a month and I have a dataset composed of 240 months of 100x100 matrices).
The values in the matrices range from 1 to 15, representing vegetation types (grass, tropical forest, tundra etc).
My first set of matrices, m1, is my control experiment. My second set of matrices, m2, is a climate change experiment where changes in climate induce changes in the values of the matrices.
Therefore, the data is represented like this:
m1: set of 240 100x100 matrices, each matrix corresponding to a month (therefore 240 months of data). This is my control data
m2: same as m1, but the values are different because of some changes in climate. This is my experimental data.
Here is some data:
# generate dataset 1
set.seed(4)
someData1 <- round(runif(100 * 100 * 240, min=1, max=15),digits=0)
# generate dataset2
set.seed(5)
someData2 <- round(runif(100 * 100 * 240, min=1, max=15),digits=0)
# create matrices
k = 240; n=100; m = 100
m1 <- array(someData1, c(n,m,k))
m2 <- array(someData2, c(n,m,k))
What I would like to do is compare each cell of m2 relative to m1 in this way:
is the value different? yes/no
if yes, what was the change? for example 1 to 10, or 2 to 7 and so on.
and do the same for all 240 matrices in m2 relative to all 240 matrices in m1.
By the end, I would like to be able to:
have a binary matrix showing whether or not there has been changes in the values;
have a table with the frequency of changes in each class (i.e. 1 to 10, 2 to 7 etc).
Conceptually, what I need to achieve would be something like this:
where for simplicity sake I drew 5x5 matrices instead of 100x100 matrices.
How to achieve this in R?
To compare two matrices, use == or !=.
what.changed <- m1 != m2 # T if changed F if not
changes <- ifelse(what.changed, paste(m1, 'to', m2), NA)
changes # for your little matrices not the 100x100
[,1] [,2] [,3]
[1,] NA "7 to 10" "6 to 7"
[2,] NA NA NA
[3,] "3 to 4" "6 to 8" NA
Your matrices seem rather large, so I'm not sure if some sort of sparse matrix approach might be better. In regards to storing the changes as a string ("3 to 4"), perhaps you could only store changes where there is in fact a change, rather than creating such a large matrix where most of the elements are NA. e.g.
Or perhaps you could create a CSV/dataframe summarising your changes e.g. (using your 100x100x240 matrices to demonstrate the 3 coordinates):
# find coordinates of changes
change.coords <- which(m1 != m2, arr.ind=T)
colnames(change.coords) <- c('x', 'y', 'time') # whatever makes sense to your application
changes <- data.frame(change.coords, old=m1[change.coords], new=m2[change.coords])
head(changes)
x y time old new
1 1 1 1 9 4
2 2 1 1 1 11
3 3 1 1 5 14
4 5 1 1 12 2
5 6 1 1 5 11
6 7 1 1 11 8
Then you can print it out as you wish without having to store heaps of strings ("X to Y") and NAs, e.g (don't do this with your big example matrices, there are waaay too many changes and it will print them /all/):
with(changes, message(sprintf("Coords (%i, %i, %i): %i to %i\n",
x, y, time, old, new)))
I'm using the SVD package with R and I'm able to reduce the dimensionality of my matrix by replacing the lowest singular values by 0. But when I recompose my matrix I still have the same number of features, I could not find how to effectively delete the most useless features of the source matrix in order to reduce it's number of columns.
For example what I'm doing for the moment:
This is my source matrix A:
A B C D
1 7 6 1 6
2 4 8 2 4
3 2 3 2 3
4 2 3 1 3
If I do:
s = svd(A)
s$d[3:4] = 0 # Replacement of the 2 smallest singular values by 0
A' = s$u %*% diag(s$d) %*% t(s$v)
I get A' which has the same dimensions (4x4), was reconstruct with only 2 "components" and is an approximation of A (containing a little bit less information, maybe less noise, etc.):
[,1] [,2] [,3] [,4]
1 6.871009 5.887558 1.1791440 6.215131
2 3.799792 7.779251 2.3862880 4.357163
3 2.289294 3.512959 0.9876354 2.386322
4 2.408818 3.181448 0.8417837 2.406172
What I want is a sub matrix with less columns but reproducing the distances between the different rows, something like this (obtained using PCA, let's call it A''):
PC1 PC2
1 -3.588727 1.7125360
2 -2.065012 -2.2465708
3 2.838545 0.1377343 # The similarity between rows 3
4 2.815194 0.3963005 # and 4 in A is conserved in A''
Here is the code to get A'' with PCA:
p = prcomp(A)
A'' = p$x[,1:2]
The final goal is to reduce the number of columns in order to speed up clustering algorithms on huge datasets.
Thank you in advance if someone can guide me :)
I would check out this chapter on dimensionality reduction or this cross-validated question. The idea is that the entire data set can be reconstructed using less information. It's not like PCA in the sense that you might only choose to keep 2 out of 10 principal components.
When you do the kind of trimming you did above, you're really just taking out some of the "noise" of your data. The data still as the same dimension.