K-means algorithm, R - r

everyone! I've been asked to create an K-means algorithm on R, but I don't really know the language, so I've found some example code on the internet, and decided to use. I've looked into it, learned the functions that are being used in it, and corrected it a bit, because it didn't work very well. Here's the code:
# Creating a sample of data
y=rnorm(500,1.65)
x=rnorm(500,1.15)
x=cbind(x,y)
centers <- x[sample(nrow(x),5),]
# A function for calculating the distance between centers and the rest of the dots
euclid <- function(points1, points2) {
distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
for(i in 1:nrow(points2)) {
distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
}
distanceMatrix
}
# A method function
K_means <- function(x, centers, euclid, nItter) {
clusterHistory <- vector(nItter, mode="list")
centerHistory <- vector(nItter, mode="list")
for(i in 1:nItter) {
distsToCenters <- euclid(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)
# Saving history
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}
structure(list(clusters = clusterHistory, centers = centerHistory))
}
res <- K_means(x, centers, euclid, 5)
#To use the same plot operations I had to use unlist, since the resulting object in my function is a list of lists,
#and default object is just a list. And also i store the history of each iteration in that object.
res <- unlist(res, recursive = FALSE)
plot(x, col = res$clusters5)
points(res$centers5, col = 1:5, pch = 8, cex = 2)
It works fine on this simple matrix. But I've been asked to use it on iris:
head(iris)
a <-data.frame(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)
centers <- a[sample(nrow(a),3),]
iris_clusters <- K_means(a, centers, euclid, 3)
iris_clusters <- unlist(iris_clusters, recursive = FALSE)
head(iris_clusters)
And the problem is that it doesn't work. The error is:
Error in distanceMatrix[, i] <- sqrt(rowSums(t(t(points1) - points2[i, :
number of items to replace is not a multiple of replacement length
I understand that dimensions of objects don't match, but I don't understand why. That's why i'm asking for help. I apologize for all the stupidity there may be in this code in advance, but I'm not really familiar with the language yet, so don't judge me too harsh. Thank you!

Your implementation should work with simple typecasts
iris_clusters <- K_means(as.matrix(a), as.matrix(centers), euclid, 3) # 3 iterations
iris_clusters <- unlist(iris_clusters, recursive = FALSE)
# plotting the clusters obtained on the first two dimensions at the end of 3rd iteration
plot(a[,1:2], col = iris_clusters$clusters3, pch=19)
points(iris_clusters$centers3, col = 1:5, pch = 8, cex = 2)
head(iris_clusters)
# cluster assignments and centroids computed at different iterations
$clusters1
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 2 3 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 3 2 2 3 3
[77] 2 2 3 3 3 3 3 2 3 3 2 3 3 3 3 2 3 3 3 3 3 3 3 3 1 2 1 2 1 1 3 1 1 1 2 2 2 2 2 2 2 1 1 2 1 2 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 2 1 1 2 2 2 2 2
$clusters2
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 3 2 3 3 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2
[77] 2 2 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 1 2 1 2 1 1 2 1 1 1 2 2 1 2 2 2 2 1 1 2 1 2 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 1 1 2 2 1 1 2 2 2 2 2
$clusters3
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[77] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 1 2 1 2 1 1 2 1 1 1 2 2 1 2 2 2 2 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 1 1 2 2 2 2
$centers1
iris.Sepal.Length iris.Sepal.Width iris.Petal.Length iris.Petal.Width
1 7.150000 3.120000 6.090000 2.1350000
2 6.315909 2.915909 5.059091 1.8000000
3 5.297674 3.115116 2.550000 0.6744186
$centers2
iris.Sepal.Length iris.Sepal.Width iris.Petal.Length iris.Petal.Width
1 7.122727 3.113636 6.031818 2.1318182
2 6.123529 2.852941 4.741176 1.6132353
3 5.056667 3.268333 1.810000 0.3883333
$centers3
iris.Sepal.Length iris.Sepal.Width iris.Petal.Length iris.Petal.Width
1 7.014815 3.096296 5.918519 2.155556
2 6.025714 2.805714 4.588571 1.518571
3 5.005660 3.369811 1.560377 0.290566

Related

Form a vector A of length 100 whose odd and even position value will be 2 and 3, respectively

I would like to form a vector A of length 100 whose odd and even position value will be 2 and 3, respectively, without using loop or replacing procedure.
A<- rep(0, 100)
A[seq(2,10, 2)] = 3
A[seq(1,10, 2)] = 2
You can use the rep function you used initially:
A <- rep(2:3, 50)
You can try:
ifelse(seq_along(A) %% 2 == 1, 2, 3)
[1] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
[59] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
Create a 2x50 matrix with the first row being 2s and the second row being 3s. Then cast it as a vector.
c(rbind(rep(2, 50), rep(3, 50)))
[1] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
[91] 2 3 2 3 2 3 2 3 2 3
To simplify ifelse solution:
3 - (seq_along(A) %% 2)
or using recycling:
A[ c(TRUE, FALSE) ] <- 2
A[ c(FALSE, TRUE) ] <- 3

Is there a way to change the index on kmeans()$cluster?

I am using kmeans() to create groups based on a score. The goal is to assign star ratings, so that the individuals with the highest scores get four stars, and the individuals with the lowest scores get 1 star. I would like to create the star variable based on the kmeans()$cluster value. However, as it stands, kmeans()$cluster indexes the clusters, but the index does not correspond to the relative position of the group.
Is there a way to manually assign the cluster indexes, or to set the index to be assigned in a certain order? I'm hoping to have kmeans()$cluster=1 for the low score group, kmeans()$cluster=2 for second lowest, etc.
id <- 1:500
set.seed(12); score <- runif(500, 0, 1)
dat <- data.frame(id, score)
km = kmeans(dat$score, 4, nstart=10)
plot(dat$score,
col = c(km$cluster),
main="K-Means result with 4 clusters",
pch=20,
cex=0.8)
dat$star <- km$cluster
plot(dat$score,
dat$star,
main="Score v. cluster number")
Any of these will yield a new cluster assignment vector such that 1 refers to the cluster with the smallest center, 2 the next and so on. The first is expressed solely in terms of fitted(km) whereas the second is expressed in terms of km$centers and km$cluster and the last is expressed in terms of fitted(km) and km$center
fit <- fitted(km)
factor(fit, labels = 1:nlevels(factor(fit)))
rank(km$centers)[km$cluster])
match(fitted(km), sort(km$centers))
Yes. You can just use a small table of what you want the values to be and use the original cluster number to look them up. Here is an example.
set.seed(2017)
KM3 = kmeans(iris[,1:4], 3)
KM3$cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[39] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[77] 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3
[115] 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1 1 3
The clusters are in an awkward order. I want the low numbered points to be in cluster 1, the middle in cluster 2 and the high numbered points in cluster 3. So I want to change all of the 1's to 3, the 2's to 1 and the 3's to 2.
Relabel = c(3,1,2)
KM3$cluster = Relabel[KM3$cluster]
KM3$cluster
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[77] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2
[115] 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2
Just a little extra detail. It says Relabel = c(3,1,2) because I want 1 to become 3, so the first location has a 3. I want 2 to become 1, so the second location has a 1. And i want 3 to become 2 so the third location has a 2.

Hierarchical Clustering produces list instead of hclust

I have been doing some hierarchical clusterings in R. Its worked out fine up til now, producing hclust objects left and center, but suddenly not anymore. Now it will only produce lists when performing:
mydata.clusters <- hclust(dist(mydata[, 1:8]))
mydata.clustercut <- cutree(mydata.clusters, 4)
and when trying to:
table(mydata.clustercut, mydata$customer_lifetime)
it doesnt produce a table, but an endless print of the values (Im guessing from the list).
The cutree function provide the grouping to which each observation belong to. For example:
iris.clust <- hclust(dist(iris[,1:4]))
iris.clustcut <- cutree(iris.clust, 4)
iris.clustcut
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
# [52] 2 2 3 2 3 2 3 2 3 3 3 3 2 3 2 3 3 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 2 3 2 2 2 3 3 3 2 3 3 3 3 3 2 3 3 2 2
# [103] 4 2 2 4 3 4 2 4 2 2 2 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 4 4 4 2 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Additional comparison can then be done by using this as a grouping variable for the observed data:
new.iris <- data.frame(iris, gp=iris.clustcut)
# example to visualise quickly the Species membership of each group
library(ggplot2)
ggplot(new.iris, aes(gp, fill=Species)) +
geom_bar()

R predict.glm when newdata has fewer levels

I attempted to prove to myself that predict() will not give incorrect predictions, when labels and levels (the underlying integer for the factor level) of newdata do not match that of the train data.
I think I did prove that, and I'm sharing that code below, but I'd just like to ask what exactly R is doing when predicting for newdata. I know it is not appending newdata to training data, does it translate the factor labels of newdata into the corresponding representation of train data before predicting?
options(stringsAsFactors = TRUE)
dat <- data.frame(x = rep(c("cat", "dog", "bird", "horse"), 100), y = rgamma(100, shape=3, scale = 300))
model <- glm(y~., family = Gamma(link = "log"), data = dat)
coefficients(model)
# (Intercept) xcat xdog xhorse
# 6.5816536 0.2924488 0.3586094 0.2740487
newdata1 <- data.frame(x = "cat")
newdata2 <- data.frame(x = "bird")
newdata3 <- data.frame(x = "dog")
predict.glm(object = model, newdata = newdata1, type = "response")
# 1
# 966.907
exp(6.5816536 + 0.2924488) #intercept + cat coef
# [1] 966.9071
predict.glm(object = model, newdata = newdata2, type = "response")
# 1
# 721.7318
exp(6.5816536)
# [1] 721.7318
predict.glm(object = model, newdata = newdata3, type = "response")
# 1
# 1033.042
exp(6.5816536 + 0.3586094)
# [1] 1033.042
unclass(dat$x)
# [1] 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3
# [87] 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4
# [173] 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3
# [259] 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4
# [345] 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4
# attr(,"levels")
# [1] "bird" "cat" "dog" "horse"
unclass(newdata1$x)
# [1] 1
# attr(,"levels")
# [1] "cat"
unclass(newdata2$x)
# [1] 1
# attr(,"levels")
# [1] "bird"
Model object has an xlevels recording factor levels used for model estimation. For your example, we have:
model$xlevels
#$x
#[1] "bird" "cat" "dog" "horse"
When your new data is presented in prediction, factor levels will be matched. For example, your newdata1 will be matched to "cat" levels, and this is the second level in xlevels. Thus, predict will have no difficulty finding the correct coefficients for that level.

Episode count for each row

I'm sure this has been asked before but for the life of me I can't figure out what to search for!
I have the following data:
x y
1 3
1 3
1 3
1 2
1 2
2 2
2 4
3 4
3 4
And I would like to output a running count that resets everytime either x or y changes value.
x y o
1 3 1
1 3 2
1 3 3
1 2 1
1 2 2
2 2 1
2 4 1
3 4 1
3 4 2
Try something like
df<-read.table(header=T,text="x y
1 3
1 3
1 3
1 2
1 2
2 2
2 4
3 4
3 4")
cbind(df,o=sequence(rle(paste(df$x,df$y))$lengths))
> cbind(df,o=sequence(rle(paste(df$x,df$y))$lengths))
x y o
1 1 3 1
2 1 3 2
3 1 3 3
4 1 2 1
5 1 2 2
6 2 2 1
7 2 4 1
8 3 4 1
9 3 4 2
After seeing #ttmaccer's I see my first attempt with ave was wrong and this is perhaps what is needed:
> dat$o <- ave(dat$y, list(dat$y, dat$x), FUN=seq )
# there was a warning but the answer is corect.
> dat
x y o
1 1 3 1
2 1 3 2
3 1 3 3
4 1 2 1
5 1 2 2
6 2 2 1
7 2 4 1
8 3 4 1
9 3 4 2

Resources