Is there a way to change the index on kmeans()$cluster? - r

I am using kmeans() to create groups based on a score. The goal is to assign star ratings, so that the individuals with the highest scores get four stars, and the individuals with the lowest scores get 1 star. I would like to create the star variable based on the kmeans()$cluster value. However, as it stands, kmeans()$cluster indexes the clusters, but the index does not correspond to the relative position of the group.
Is there a way to manually assign the cluster indexes, or to set the index to be assigned in a certain order? I'm hoping to have kmeans()$cluster=1 for the low score group, kmeans()$cluster=2 for second lowest, etc.
id <- 1:500
set.seed(12); score <- runif(500, 0, 1)
dat <- data.frame(id, score)
km = kmeans(dat$score, 4, nstart=10)
plot(dat$score,
col = c(km$cluster),
main="K-Means result with 4 clusters",
pch=20,
cex=0.8)
dat$star <- km$cluster
plot(dat$score,
dat$star,
main="Score v. cluster number")

Any of these will yield a new cluster assignment vector such that 1 refers to the cluster with the smallest center, 2 the next and so on. The first is expressed solely in terms of fitted(km) whereas the second is expressed in terms of km$centers and km$cluster and the last is expressed in terms of fitted(km) and km$center
fit <- fitted(km)
factor(fit, labels = 1:nlevels(factor(fit)))
rank(km$centers)[km$cluster])
match(fitted(km), sort(km$centers))

Yes. You can just use a small table of what you want the values to be and use the original cluster number to look them up. Here is an example.
set.seed(2017)
KM3 = kmeans(iris[,1:4], 3)
KM3$cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[39] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[77] 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3
[115] 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1 1 3
The clusters are in an awkward order. I want the low numbered points to be in cluster 1, the middle in cluster 2 and the high numbered points in cluster 3. So I want to change all of the 1's to 3, the 2's to 1 and the 3's to 2.
Relabel = c(3,1,2)
KM3$cluster = Relabel[KM3$cluster]
KM3$cluster
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[77] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2
[115] 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2
Just a little extra detail. It says Relabel = c(3,1,2) because I want 1 to become 3, so the first location has a 3. I want 2 to become 1, so the second location has a 1. And i want 3 to become 2 so the third location has a 2.

Related

Hierarchical Clustering produces list instead of hclust

I have been doing some hierarchical clusterings in R. Its worked out fine up til now, producing hclust objects left and center, but suddenly not anymore. Now it will only produce lists when performing:
mydata.clusters <- hclust(dist(mydata[, 1:8]))
mydata.clustercut <- cutree(mydata.clusters, 4)
and when trying to:
table(mydata.clustercut, mydata$customer_lifetime)
it doesnt produce a table, but an endless print of the values (Im guessing from the list).
The cutree function provide the grouping to which each observation belong to. For example:
iris.clust <- hclust(dist(iris[,1:4]))
iris.clustcut <- cutree(iris.clust, 4)
iris.clustcut
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
# [52] 2 2 3 2 3 2 3 2 3 3 3 3 2 3 2 3 3 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 2 3 2 2 2 3 3 3 2 3 3 3 3 3 2 3 3 2 2
# [103] 4 2 2 4 3 4 2 4 2 2 2 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 4 4 4 2 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Additional comparison can then be done by using this as a grouping variable for the observed data:
new.iris <- data.frame(iris, gp=iris.clustcut)
# example to visualise quickly the Species membership of each group
library(ggplot2)
ggplot(new.iris, aes(gp, fill=Species)) +
geom_bar()

K-means algorithm, R

everyone! I've been asked to create an K-means algorithm on R, but I don't really know the language, so I've found some example code on the internet, and decided to use. I've looked into it, learned the functions that are being used in it, and corrected it a bit, because it didn't work very well. Here's the code:
# Creating a sample of data
y=rnorm(500,1.65)
x=rnorm(500,1.15)
x=cbind(x,y)
centers <- x[sample(nrow(x),5),]
# A function for calculating the distance between centers and the rest of the dots
euclid <- function(points1, points2) {
distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
for(i in 1:nrow(points2)) {
distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
}
distanceMatrix
}
# A method function
K_means <- function(x, centers, euclid, nItter) {
clusterHistory <- vector(nItter, mode="list")
centerHistory <- vector(nItter, mode="list")
for(i in 1:nItter) {
distsToCenters <- euclid(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)
# Saving history
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}
structure(list(clusters = clusterHistory, centers = centerHistory))
}
res <- K_means(x, centers, euclid, 5)
#To use the same plot operations I had to use unlist, since the resulting object in my function is a list of lists,
#and default object is just a list. And also i store the history of each iteration in that object.
res <- unlist(res, recursive = FALSE)
plot(x, col = res$clusters5)
points(res$centers5, col = 1:5, pch = 8, cex = 2)
It works fine on this simple matrix. But I've been asked to use it on iris:
head(iris)
a <-data.frame(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)
centers <- a[sample(nrow(a),3),]
iris_clusters <- K_means(a, centers, euclid, 3)
iris_clusters <- unlist(iris_clusters, recursive = FALSE)
head(iris_clusters)
And the problem is that it doesn't work. The error is:
Error in distanceMatrix[, i] <- sqrt(rowSums(t(t(points1) - points2[i, :
number of items to replace is not a multiple of replacement length
I understand that dimensions of objects don't match, but I don't understand why. That's why i'm asking for help. I apologize for all the stupidity there may be in this code in advance, but I'm not really familiar with the language yet, so don't judge me too harsh. Thank you!
Your implementation should work with simple typecasts
iris_clusters <- K_means(as.matrix(a), as.matrix(centers), euclid, 3) # 3 iterations
iris_clusters <- unlist(iris_clusters, recursive = FALSE)
# plotting the clusters obtained on the first two dimensions at the end of 3rd iteration
plot(a[,1:2], col = iris_clusters$clusters3, pch=19)
points(iris_clusters$centers3, col = 1:5, pch = 8, cex = 2)
head(iris_clusters)
# cluster assignments and centroids computed at different iterations
$clusters1
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 2 3 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 3 2 2 3 3
[77] 2 2 3 3 3 3 3 2 3 3 2 3 3 3 3 2 3 3 3 3 3 3 3 3 1 2 1 2 1 1 3 1 1 1 2 2 2 2 2 2 2 1 1 2 1 2 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 2 1 1 2 2 2 2 2
$clusters2
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 3 2 3 3 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2
[77] 2 2 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 1 2 1 2 1 1 2 1 1 1 2 2 1 2 2 2 2 1 1 2 1 2 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 1 1 2 2 1 1 2 2 2 2 2
$clusters3
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[77] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 1 2 1 2 1 1 2 1 1 1 2 2 1 2 2 2 2 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 1 1 2 2 2 2
$centers1
iris.Sepal.Length iris.Sepal.Width iris.Petal.Length iris.Petal.Width
1 7.150000 3.120000 6.090000 2.1350000
2 6.315909 2.915909 5.059091 1.8000000
3 5.297674 3.115116 2.550000 0.6744186
$centers2
iris.Sepal.Length iris.Sepal.Width iris.Petal.Length iris.Petal.Width
1 7.122727 3.113636 6.031818 2.1318182
2 6.123529 2.852941 4.741176 1.6132353
3 5.056667 3.268333 1.810000 0.3883333
$centers3
iris.Sepal.Length iris.Sepal.Width iris.Petal.Length iris.Petal.Width
1 7.014815 3.096296 5.918519 2.155556
2 6.025714 2.805714 4.588571 1.518571
3 5.005660 3.369811 1.560377 0.290566

how to read numbers with irregular spaces in R

I have my data in txt file, contain the following number, how to read into R
I tied fread but did not work
Error in fread("x.txt") :
Expected sep (' ') but new line, EOF (or other non printing character) ends field 0 when detecting types ( first):
Here is the data:
2 3 3 2 1 2 3 2 3 2 1 3 1 2
1 1 3 2 3 1 2 1 2 3 3 2
3 1 1 1 2 1 1 3 1 2 2 2
1 3 1 1 3 2 3 3 1 1 2 2
1 3 2 3 2 1 3 1 1 1 3 1
1 3 1 2 3 3 2 2 2 2 3 3
1 3 2 3 2 3 2 2 2 1 3 1
3 2 1 2 2 3 3 2 3 2 3 3
2 1
Try this.
x <- scan("x.txt")
data <- as.data.frame(x)

How to get the count or percentage of one factor variable in each decile ?

My dataset is like the following example
Tier Decile
1 1
1 1
2 1
3 1
2 1
2 2
1 2
3 2
3 2
3 2
1 3
2 3
2 3
3 3
3 3
I want to get the answer like the following if the simple count or may be in the percentage.
Is lapply or Aggregate function can work ?
Tier Decile1 Decile2 Decile3
Tier= 1 2 1 1
Tier = 2 2 1 2
Tier = 3 1 2 2
Use table. Assume df is your data.frame
> with(df, table(Tier, Decile))
Decile
Tier 1 2 3
1 2 1 1
2 2 1 2
3 1 3 2

Episode count for each row

I'm sure this has been asked before but for the life of me I can't figure out what to search for!
I have the following data:
x y
1 3
1 3
1 3
1 2
1 2
2 2
2 4
3 4
3 4
And I would like to output a running count that resets everytime either x or y changes value.
x y o
1 3 1
1 3 2
1 3 3
1 2 1
1 2 2
2 2 1
2 4 1
3 4 1
3 4 2
Try something like
df<-read.table(header=T,text="x y
1 3
1 3
1 3
1 2
1 2
2 2
2 4
3 4
3 4")
cbind(df,o=sequence(rle(paste(df$x,df$y))$lengths))
> cbind(df,o=sequence(rle(paste(df$x,df$y))$lengths))
x y o
1 1 3 1
2 1 3 2
3 1 3 3
4 1 2 1
5 1 2 2
6 2 2 1
7 2 4 1
8 3 4 1
9 3 4 2
After seeing #ttmaccer's I see my first attempt with ave was wrong and this is perhaps what is needed:
> dat$o <- ave(dat$y, list(dat$y, dat$x), FUN=seq )
# there was a warning but the answer is corect.
> dat
x y o
1 1 3 1
2 1 3 2
3 1 3 3
4 1 2 1
5 1 2 2
6 2 2 1
7 2 4 1
8 3 4 1
9 3 4 2

Resources