I have been doing some hierarchical clusterings in R. Its worked out fine up til now, producing hclust objects left and center, but suddenly not anymore. Now it will only produce lists when performing:
mydata.clusters <- hclust(dist(mydata[, 1:8]))
mydata.clustercut <- cutree(mydata.clusters, 4)
and when trying to:
table(mydata.clustercut, mydata$customer_lifetime)
it doesnt produce a table, but an endless print of the values (Im guessing from the list).
The cutree function provide the grouping to which each observation belong to. For example:
iris.clust <- hclust(dist(iris[,1:4]))
iris.clustcut <- cutree(iris.clust, 4)
iris.clustcut
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
# [52] 2 2 3 2 3 2 3 2 3 3 3 3 2 3 2 3 3 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 2 3 2 2 2 3 3 3 2 3 3 3 3 3 2 3 3 2 2
# [103] 4 2 2 4 3 4 2 4 2 2 2 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 4 4 4 2 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Additional comparison can then be done by using this as a grouping variable for the observed data:
new.iris <- data.frame(iris, gp=iris.clustcut)
# example to visualise quickly the Species membership of each group
library(ggplot2)
ggplot(new.iris, aes(gp, fill=Species)) +
geom_bar()
everyone! I've been asked to create an K-means algorithm on R, but I don't really know the language, so I've found some example code on the internet, and decided to use. I've looked into it, learned the functions that are being used in it, and corrected it a bit, because it didn't work very well. Here's the code:
# Creating a sample of data
y=rnorm(500,1.65)
x=rnorm(500,1.15)
x=cbind(x,y)
centers <- x[sample(nrow(x),5),]
# A function for calculating the distance between centers and the rest of the dots
euclid <- function(points1, points2) {
distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
for(i in 1:nrow(points2)) {
distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
}
distanceMatrix
}
# A method function
K_means <- function(x, centers, euclid, nItter) {
clusterHistory <- vector(nItter, mode="list")
centerHistory <- vector(nItter, mode="list")
for(i in 1:nItter) {
distsToCenters <- euclid(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)
# Saving history
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}
structure(list(clusters = clusterHistory, centers = centerHistory))
}
res <- K_means(x, centers, euclid, 5)
#To use the same plot operations I had to use unlist, since the resulting object in my function is a list of lists,
#and default object is just a list. And also i store the history of each iteration in that object.
res <- unlist(res, recursive = FALSE)
plot(x, col = res$clusters5)
points(res$centers5, col = 1:5, pch = 8, cex = 2)
It works fine on this simple matrix. But I've been asked to use it on iris:
head(iris)
a <-data.frame(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)
centers <- a[sample(nrow(a),3),]
iris_clusters <- K_means(a, centers, euclid, 3)
iris_clusters <- unlist(iris_clusters, recursive = FALSE)
head(iris_clusters)
And the problem is that it doesn't work. The error is:
Error in distanceMatrix[, i] <- sqrt(rowSums(t(t(points1) - points2[i, :
number of items to replace is not a multiple of replacement length
I understand that dimensions of objects don't match, but I don't understand why. That's why i'm asking for help. I apologize for all the stupidity there may be in this code in advance, but I'm not really familiar with the language yet, so don't judge me too harsh. Thank you!
Your implementation should work with simple typecasts
iris_clusters <- K_means(as.matrix(a), as.matrix(centers), euclid, 3) # 3 iterations
iris_clusters <- unlist(iris_clusters, recursive = FALSE)
# plotting the clusters obtained on the first two dimensions at the end of 3rd iteration
plot(a[,1:2], col = iris_clusters$clusters3, pch=19)
points(iris_clusters$centers3, col = 1:5, pch = 8, cex = 2)
head(iris_clusters)
# cluster assignments and centroids computed at different iterations
$clusters1
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 2 3 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 3 2 2 3 3
[77] 2 2 3 3 3 3 3 2 3 3 2 3 3 3 3 2 3 3 3 3 3 3 3 3 1 2 1 2 1 1 3 1 1 1 2 2 2 2 2 2 2 1 1 2 1 2 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 2 1 1 2 2 2 2 2
$clusters2
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 3 2 3 3 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2
[77] 2 2 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 1 2 1 2 1 1 2 1 1 1 2 2 1 2 2 2 2 1 1 2 1 2 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 1 1 2 2 1 1 2 2 2 2 2
$clusters3
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[77] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 1 2 1 2 1 1 2 1 1 1 2 2 1 2 2 2 2 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 1 1 2 2 2 2
$centers1
iris.Sepal.Length iris.Sepal.Width iris.Petal.Length iris.Petal.Width
1 7.150000 3.120000 6.090000 2.1350000
2 6.315909 2.915909 5.059091 1.8000000
3 5.297674 3.115116 2.550000 0.6744186
$centers2
iris.Sepal.Length iris.Sepal.Width iris.Petal.Length iris.Petal.Width
1 7.122727 3.113636 6.031818 2.1318182
2 6.123529 2.852941 4.741176 1.6132353
3 5.056667 3.268333 1.810000 0.3883333
$centers3
iris.Sepal.Length iris.Sepal.Width iris.Petal.Length iris.Petal.Width
1 7.014815 3.096296 5.918519 2.155556
2 6.025714 2.805714 4.588571 1.518571
3 5.005660 3.369811 1.560377 0.290566
I have my data in txt file, contain the following number, how to read into R
I tied fread but did not work
Error in fread("x.txt") :
Expected sep (' ') but new line, EOF (or other non printing character) ends field 0 when detecting types ( first):
Here is the data:
2 3 3 2 1 2 3 2 3 2 1 3 1 2
1 1 3 2 3 1 2 1 2 3 3 2
3 1 1 1 2 1 1 3 1 2 2 2
1 3 1 1 3 2 3 3 1 1 2 2
1 3 2 3 2 1 3 1 1 1 3 1
1 3 1 2 3 3 2 2 2 2 3 3
1 3 2 3 2 3 2 2 2 1 3 1
3 2 1 2 2 3 3 2 3 2 3 3
2 1
Try this.
x <- scan("x.txt")
data <- as.data.frame(x)
My dataset is like the following example
Tier Decile
1 1
1 1
2 1
3 1
2 1
2 2
1 2
3 2
3 2
3 2
1 3
2 3
2 3
3 3
3 3
I want to get the answer like the following if the simple count or may be in the percentage.
Is lapply or Aggregate function can work ?
Tier Decile1 Decile2 Decile3
Tier= 1 2 1 1
Tier = 2 2 1 2
Tier = 3 1 2 2
Use table. Assume df is your data.frame
> with(df, table(Tier, Decile))
Decile
Tier 1 2 3
1 2 1 1
2 2 1 2
3 1 3 2
I'm sure this has been asked before but for the life of me I can't figure out what to search for!
I have the following data:
x y
1 3
1 3
1 3
1 2
1 2
2 2
2 4
3 4
3 4
And I would like to output a running count that resets everytime either x or y changes value.
x y o
1 3 1
1 3 2
1 3 3
1 2 1
1 2 2
2 2 1
2 4 1
3 4 1
3 4 2
Try something like
df<-read.table(header=T,text="x y
1 3
1 3
1 3
1 2
1 2
2 2
2 4
3 4
3 4")
cbind(df,o=sequence(rle(paste(df$x,df$y))$lengths))
> cbind(df,o=sequence(rle(paste(df$x,df$y))$lengths))
x y o
1 1 3 1
2 1 3 2
3 1 3 3
4 1 2 1
5 1 2 2
6 2 2 1
7 2 4 1
8 3 4 1
9 3 4 2
After seeing #ttmaccer's I see my first attempt with ave was wrong and this is perhaps what is needed:
> dat$o <- ave(dat$y, list(dat$y, dat$x), FUN=seq )
# there was a warning but the answer is corect.
> dat
x y o
1 1 3 1
2 1 3 2
3 1 3 3
4 1 2 1
5 1 2 2
6 2 2 1
7 2 4 1
8 3 4 1
9 3 4 2