Improve processing performance in R for Social Network Analysis - r

I am doing social network analysis using igraph package in R and I am dealing with close to 2 million vertices and edges. Also calculating degrees of separations which are nearly 8 million vertices and edges. Usually, it takes somewhere between 2 to 3 hours for execution which is way too much high. I need some input and suggestions to improve this performance. Below is the sample code I am using:
g <- graph.data.frame( ids, directed = F) # ids contains approximately 2 million records
distances(graph = g, v = t_ids$ID_from[x], to = t_ids$ID_to[x], weights = NA)
# t_ids contains approximately 8 million records for which degrees of separation is to be calculated using Shortest Path Algorithms
Thanking in advance!

I don't think so, but I'd be very happy to be proven wrong.
You should look into other ways of optimising the code that is running.
If your data is fixed, you could compute the distances once, save the (probably rather big) distance matrix, and ask that for degrees of separation.
If your analysis does not require distances between all x vertices, you should look into making optimisations in your code by shortening t_ids$ID_from[x]. Get only the distances you need. I suspect that you're already doing this, though.
distances() actually computes rather quickly. At 10'000 nodes (which amounts to 4,99*10^6 undirected distances), my crappy machine gets a full 700MB large distance-matrix in a few seconds.
I first thought about the different algorithms you can choose in distances(), but now I doubt that they will help you. I ran a speed-test on the different algorithms to see if I could recommend any of them to you, but they all seem to run at more or less the same speed (results are relations to time to compute using automatic algorithm that would be used in your code above):
sample automatic unweighted dijkstra bellman-ford johnson
1 10 1 0.9416667 0.9750000 1.0750000 1.0833333
2 100 1 0.9427083 0.9062500 0.8906250 0.8958333
3 1000 1 0.9965636 0.9656357 0.9977090 0.9873998
4 5000 1 0.9674200 0.9947269 0.9691149 1.0007533
5 10000 1 1.0070885 0.9938136 0.9974223 0.9953602
I don't think anything can be concluded from this, but it's running on an Erdős-Rényi model. It's possible that your network structure favours one algorithm over another, but they would still not give you anywhere near the performance boost that you're hoping for.
The code is here:
# igrpah
library(igraph)
# setup:
samplesizes <- c(10, 100, 1000, 5000, 10000)
reps <- c(100, 100, 15, 3, 1)
algorithms = c("automatic", "unweighted", "dijkstra", "bellman-ford", "johnson")
df <- as.data.frame(matrix(ncol=length(algorithms), nrow=0), stringsAsFactors = FALSE)
names(df) <- algorithms
# any random graph
g <- erdos.renyi.game(10000, 10000, "gnm")
# These are the different algorithms used by distances:
m.auto <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="automatic")
m.unwg <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="unweighted")
m.dijk <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="dijkstra")
m.belm <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="bellman-ford")
m.john <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="johnson")
# They produce the same result:
sum(m.auto == m.unwg & m.auto == m.dijk & m.auto == m.belm & m.auto == m.john) == length(m.auto)
# Use this function will be used to test the speed of distances() run using different algorithms
test_distances <- function(alg){
m.auto <- distances(g, v=V(g), to=V(g), weights=NA, algorithm=alg)
(TRUE)
}
# Build testresults
for(i.sample in 1:length(samplesizes)){
# Create a random network to test
g <- erdos.renyi.game(samplesizes[i.sample], (samplesizes[i.sample]*1.5), type = "gnm", directed = FALSE, loops = FALSE)
i.rep <- reps[i.sample]
for(i.alg in 1:length(algorithms)){
df[i.sample,i.alg] <- system.time( replicate(i.rep, test_distances(algorithms[i.alg]) ) )[['elapsed']]
}
}
# Normalize benchmark results
dfn <- df
dfn[,1:length(df[,])] <- df[,1:length(df[,])] / df[,1]
dfn$sample <- samplesizes
dfn <- dfn[,c(6,1:5)]
dfn

Related

Pairwise Dijkstra's with early termination according to hop-count in R

I am looking for the most computational and memory friendly approach to computing particular entries of the distance matrix D obtained by pairwise Dijkstra's algorithm in R. More precisely, I only need D[i,j] if the hop-count (unweighted) distance between node i and node j is at most a particular integer k (D[i,j] itself may computed as a weighted shortest path length for which the number of hops may be greater than k). D should be encoded as a sparse matrix for memory efficiency.
I was wondering if there has been some work done on this or if there is an efficient approach towards optimizing the current igraph functions to account for this restriction. E.g., early exit in pairwise Dijkstra's algorithm could really improve the efficiency of solving my problem.
I have tried to make this as efficient as possible myself, but with no luck so far. Some first attempt is illustrated below.
library(igraph)
library(Matrix)
library(spam)
# Hope this to the more efficient one
bounded_hop_pairG_1 <- function(G, k=2){
to <- ego(G, order=k)
D <- sparseMatrix(i=unlist(lapply(1:length(V(G)), function(v) rep(v, length(to[[v]])))),
j=unlist(to),
x=unlist(lapply(1:length(V(G)), function(v) distances(G, v=v, to=to[[v]]))))
return(D)
}
# Hope this to be the less efficient one
bounded_hop_pairG_2 <- function(G, k=2){
D <- distances(G)
D[distances(G, weight=NA) > k] <- 0
return(as.spam(D))
}
# Sample graph
set.seed(42)
G <- sample_bipartite(500, 500, p=0.1)
E(G)$weight <- runif(length(E(G)))
# Check whether 'distances' actually implements early termination
start_time <- Sys.time()
d1 <- distances(G, v=1)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 0.00497961 secs
start_time <- Sys.time()
d2 <- distances(G, v=1, to=521)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 0.002238274 secs (consistently smaller than above)
start_time <- Sys.time()
D1 <- bounded_hop_pairG_1(G)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 2.671333 secs
start_time <- Sys.time()
D2 <- bounded_hop_pairG_2(G)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 1.101419 secs
Though I suspect my first function to apply early termination and never stores the full pairwise distance matrix, it appears to be much less efficient than my second function (which also performs a full unweighted pairwise distance computation) in terms of computational time. Hence, I was hoping somebody could point out the most efficient way to implement the first function in R.
you could try cppRouting package available via github.
It provides functions like get_distance_matrix() which can use all cores.
library(cppRouting)
library(igraph)
library(spam)
library(Matrix)
# Sample graph
set.seed(42)
G <- sample_bipartite(500, 500, p=0.1)
E(G)$weight <- runif(length(E(G)))
#Graph to data frame
G2<-as_long_data_frame(G)
#Weighted graph
graph1<-makegraph(G2[,1:3],directed = F)
#Unweighted graph
graph2<-makegraph(cbind(G2[,1:2],rep(1,nrow(G2))),directed = F)
nodes<-unique(c(G2$from,G2$to)) %>% sort
myfunc<-function(Gr1,Gr2,nd,k=2,cores=FALSE){
test<-get_distance_matrix(graph,nd,nd,allcores = cores)
test2<-get_distance_matrix(graph2,nd,nd,allcores = cores)
test[test2>k]<-0
return(as.spam(test))
}
#Your first function
system.time(
D1 <- bounded_hop_pairG_1(G)
)
#2.18s
#Your second function
system.time(
D2 <- bounded_hop_pairG_2(G)
)
#1.01s
#One core
system.time(
D3 <- myfunc(graph1,graph2,nodes))
#0.69s
#Parallel
system.time(
D4 <- myfunc(graph1,graph2,nodes,cores=TRUE))
#0.32s
If you really want to stop the algorithm when k-nodes is reached and have a little knowledge in C++, it seems rather simple to slightly modify original Dijkstra algorithm then use it via Rcpp.

How to permute a network in igraph for R?

I'm trying to write a code for a Monte Carlo procedure in R. My goal is to estimate the significance of a metric calculated for a weighted, unipartite, undirected network formatted for the package igraph.
So far, I included the following steps in the code:
1. Create the weighted, unipartite, undirected network and calculate the observed Louvain modularity
nodes <- read.delim("nodes.txt")
links <- read.delim("links.txt")
anurosnet <- graph_from_data_frame(d=links, vertices=nodes, directed=F)
anurosnet
modularity1 = cluster_louvain(anurosnet)
modularity1$modularity #observed value
obs=modularity1$modularity
obs
real<-data.frame(obs)
real
2. Create the empty vector
Nperm = 9 #I am starting with a low n, but intend to use at least 1000 permutations
randomized.modularity=matrix(nrow=length(obs),ncol=Nperm+1)
row.names(randomized.modularity)=names(obs)
randomized.modularity[,1]=obs
randomized.modularity
3. Permute the original network preserving its characteristics, calculate the Louvain modularity for all randomized networks, and compile the results in the vector
i<-1
while(i<=Nperm){
randomnet <- rewire(anurosnet, with=each_edge(0.5)) #rewire vertices with constant probability
E(randomnet)$weight <- sample(E(anurosnet)$weight) #shuffle initial weights and assign them randomly to edges
mod<-(cluster_louvain(randomnet))
mod$modularity
linha = mod$modularity
randomized.modularity[,i+1]=linha
print(i)
i=i+1
}
randomized.modularity #Here the result is not as expected
4. Plot the observed value against the distribution of randomized values
niveis<-row.names(randomized.modularity)
for(k in niveis)
{
if(any(is.na(randomized.modularity[k,]) == TRUE))
{
print(c(k, "metrica tem NA"))
} else {
nome.arq<- paste("modularity",k,".png", sep="")
png(filename= nome.arq, res= 300, height= 15, width=21, unit="cm")
plot(density(randomized.modularity[k,]), main="Observed vs. randomized",)
abline(v=obs[k], col="red", lwd=2, xlab="")
dev.off()
print(k)
nome.arq<- paste("Patefield_Null_mean_sd_",k,".txt", sep="")
write.table(cbind(mean(randomized.modularity[k,]),sd(randomized.modularity[k,])), file=paste(nome.arq,sep=""),
sep=" ",row.names=TRUE,col.names=FALSE)
}
}
5. Estimate the P-value (significance)
significance=matrix(nrow=nrow(randomized.modularity),ncol=3)
row.names(significance)=row.names(randomized.modularity)
colnames(significance)=c("p (rand <= obs)", "p (rand >= obs)", "p (rand=obs)")
signif.sup=function(x) sum(x>=x[1])/length(x)
signif.inf=function(x) sum(x<=x[1])/length(x)
signif.two=function(x) ifelse(min(x)*2>1,1,min(x)*2)
significance[,1]=apply(randomized.modularity,1,signif.inf)
significance[,2]=apply(randomized.modularity,1,signif.sup)
significance[,3]=apply(significance[,-3],1,signif.two)
significance
Something is going wrong in step 3. I expected the vector to be filled with 10 values, but for some reason it stops after a while.
The slot "mod$modularity" suddenly receives 2 values instead of 1.
The two TXT files mentioned in the beginning of the code can be downloaded from here:
https://1drv.ms/t/s!AmcVKrxj94WClv8yQyqyl4IWk5mNvQ
https://1drv.ms/t/s!AmcVKrxj94WClv8z_Pow5Tg2U7mjLw
Could you please help me?
Your error is due to a mismatch in dimensions with your randomized.modularity matrix and some of your randomized modularity results. In your example your matrix end up being [1 x Nperm] however sometimes 2 modularity scores are returned during the permutations. To fix this I simply store the results in a list. The rest of your analysis will need to be adjusted since you have a mismatch of modularity scores.
library(igraph)
nodes <- read.delim("nodes.txt")
links <- read.delim("links.txt")
anurosnet <- graph_from_data_frame(d=links, vertices=nodes, directed=F)
anurosnet
modularity1 = cluster_louvain(anurosnet)
modularity1$modularity #observed value
obs <- modularity1$modularity
obs
real<-data.frame(obs)
real
Nperm = 100 #I am starting with a low n, but intend to use at least 1000 permutations
#randomized.modularity <- matrix(nrow=length(obs),ncol=Nperm+1)
#row.names(randomized.modularity) <- names(obs)
randomized.modularity <- list()
randomized.modularity[1] <- obs
randomized.modularity
for(i in 1:Nperm){
randomnet <- rewire(anurosnet, with=each_edge(0.5)) #rewire vertices with constant probability
E(randomnet)$weight <- sample(E(anurosnet)$weight) #shuffle initial weights and assign them randomly to edges
mod <- (cluster_louvain(randomnet))
mod$modularity
linha = mod$modularity
randomized.modularity <- c(randomized.modularity, list(linha))
}
randomized.modularity
Better way to write the loop
randomized.modularity <- lapply(seq_len(Nperm), function(x){
randomnet <- rewire(anurosnet, with=each_edge(0.5)) #rewire vertices with constant probability
E(randomnet)$weight <- sample(E(anurosnet)$weight) #shuffle initial weights and assign them randomly to edges
return(cluster_louvain(randomnet)$modularity)
})

Weighted Kmeans R

I want to do a Kmeans clustering on a dataset (namely, Sample_Data) with three variables (columns) such as below:
A B C
1 12 10 1
2 8 11 2
3 14 10 1
. . . .
. . . .
. . . .
in a typical way, after scaling the columns, and determining the number of clusters, I will use this function in R:
Sample_Data <- scale(Sample_Data)
output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50)
But, what if there is a preference for the variables? I mean that, suppose variable (column) A, is more important than the two other variables?
how can I insert their weights in the model?
Thank you all
You have to use a kmeans weighted clustering, like the one presented in flexclust package:
https://cran.r-project.org/web/packages/flexclust/flexclust.pdf
The function
cclust(x, k, dist = "euclidean", method = "kmeans",
weights=NULL, control=NULL, group=NULL, simple=FALSE,
save.data=FALSE)
Perform k-means clustering, hard competitive learning or neural gas on a data matrix.
weights An optional vector of weights to be used in the fitting process. Works only in combination with hard competitive learning.
A toy example using iris data:
library(flexclust)
data(iris)
cl <- cclust(iris[,-5], k=3, save.data=TRUE,weights =c(1,0.5,1,0.1),method="hardcl")
cl
kcca object of family ‘kmeans’
call:
cclust(x = iris[, -5], k = 3, method = "hardcl", weights = c(1, 0.5, 1, 0.1), save.data = TRUE)
cluster sizes:
1 2 3
50 59 41
As you can see from the output of cclust, also using competitive learning the family is always kmenas.
The difference is related to cluster assignment during training phase:
If method is "kmeans", the classic kmeans algorithm as given by
MacQueen (1967) is used, which works by repeatedly moving all cluster
centers to the mean of their respective Voronoi sets. If "hardcl",
on-line updates are used (AKA hard competitive learning), which work
by randomly drawing an observation from x and moving the closest
center towards that point (e.g., Ripley 1996).
The weights parameter is just a sequence of numbers, in general I use number between 0.01 (minimum weight) and 1 (maximum weight).
I had the same problem and the answer here is not satisfying for me.
What we both wanted was an observation-weighted k-means clustering in R. A good readable example for our question is this link: https://towardsdatascience.com/clustering-the-us-population-observation-weighted-k-means-f4d58b370002
However the solution to use the flexclust package is not satisfying simply b/c the used algorithm is not the "standard" k-means algorithm but the "hard competitive learning" algorithm. The difference are well described above and in the package description.
I looked through many sites and did not find any solution/package in R in order to use to perform a "standard" k-means algorithm with weighted observations. I was also wondering why the flexclust package explicitly do not support weights with the standard k-means algorithm. If anyone has an explanation for this, please feel free to share!
So basically you have two options: First, rewrite the flexclust-algorithm to enable weights within the standard approach. Or second, you can estimate weighted cluster centroids as starting centroids and perform a standard k-means algorithm with only one iteration, then compute new weighted cluster centroids and perform a k-means with one iteration and so on until you reach convergence.
I used the second alternative b/c it was the easier way for me. I used the data.table package, hope you are familiar with it.
rm(list=ls())
library(data.table)
### gen dataset with sample-weights
dataset <- data.table(iris)
dataset[, weights:= rep(c(1, 0.7, 0.3, 4, 5),30)]
dataset[, Species := NULL]
### initial hclust for estimating weighted centroids
clustering <- hclust(dist(dataset[, c(1:4)], method = 'euclidean'),
method = 'ward.D2')
no_of_clusters <- 4
### estimating starting centroids (weighted)
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol = ncol(dataset[, c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
### performing weighted k-means as explained in my post
iter <- 0
cluster_i <- 0
cluster_iminus1 <- 1
## while loop: if number of iteration is smaller than 50 and cluster_i (result of
## current iteration) is not identical to cluster_iminus1 (result of former
## iteration) then continue
while(identical(cluster_i, cluster_iminus1) == F && iter < 50){
# update iteration
iter <- iter + 1
# k-means with weighted centroids and one iteration (may generate warning messages
# as no convergence is reached)
cluster_kmeans <- kmeans(x = dataset[, c(1:4)], centers = weighted_centroids, iter = 1)$cluster
# estimating new weighted centroids
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol=ncol(dataset[,c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
# update cluster_i and cluster_iminus1
if(iter == 1) {cluster_iminus1 <- 0} else{cluster_iminus1 <- cluster_i}
cluster_i <- cluster_kmeans
}
## merge final clusters to data table
dataset[, cluster := cluster_i]
If you want to increase the weight of a variable (column), just multiply it with a constant c > 1.
It's trivial to show that this increases the weight in the SSQ optimization objective.

generating random x and y coordinates with a minimum distance

Is there a way in R to generate random coordinates with a minimum distance between them?
E.g. what I'd like to avoid
x <- c(0,3.9,4.1,8)
y <- c(1,4.1,3.9,7)
plot(x~y)
This is a classical problem from stochastic geometry. Completely random points in space where the number of points falling in disjoint regions are independent of each other corresponds to a homogeneous Poisson point process (in this case in R^2, but could be in almost any space).
An important feature is that the total number of points has to be random before you can have independence of the counts of points in disjoint regions.
For the Poisson process points can be arbitrarily close together. If you define a process by sampling the Poisson process until you don't have any points that are too close together you have the so-called Gibbs Hardcore process. This has been studied a lot in the literature and there are different ways to simulate it. The R package spatstat has functions to do this. rHardcore is a perfect sampler, but if you want a high intensity of points and a big hard core distance it may not terminate in finite time... The distribution can be obtained as the limit of a Markov chain and rmh.default lets you run a Markov chain with a given Gibbs model as its invariant distribution. This finishes in finite time but only gives a realisation of an approximate distribution.
In rmh.default you can also simulate conditional on a fixed number of points. Note that when you sample in a finite box there is of course an upper limit to how many points you can fit with a given hard core radius, and the closer you are to this limit the more problematic it becomes to sample correctly from the distribution.
Example:
library(spatstat)
beta <- 100; R = 0.1
win <- square(1) # Unit square for simulation
X1 <- rHardcore(beta, R, W = win) # Exact sampling -- beware it may run forever for some par.!
plot(X1, main = paste("Exact sim. of hardcore model; beta =", beta, "and R =", R))
minnndist(X1) # Observed min. nearest neighbour dist.
#> [1] 0.102402
Approximate simulation
model <- rmhmodel(cif="hardcore", par = list(beta=beta, hc=R), w = win)
X2 <- rmh(model)
#> Checking arguments..determining simulation windows...Starting simulation.
#> Initial state...Ready to simulate. Generating proposal points...Running Metropolis-Hastings.
plot(X2, main = paste("Approx. sim. of hardcore model; beta =", beta, "and R =", R))
minnndist(X2) # Observed min. nearest neighbour dist.
#> [1] 0.1005433
Approximate simulation conditional on number of points
X3 <- rmh(model, control = rmhcontrol(p=1), start = list(n.start = 42))
#> Checking arguments..determining simulation windows...Starting simulation.
#> Initial state...Ready to simulate. Generating proposal points...Running Metropolis-Hastings.
plot(X3, main = paste("Approx. sim. given n =", 42))
minnndist(X3) # Observed min. nearest neighbour dist.
#> [1] 0.1018068
OK, how about this? You just generate random number pairs without restriction and then remove the onces which are too close. This could be a great start for that:
minimumDistancePairs <- function(x, y, minDistance){
i <- 1
repeat{
distance <- sqrt((x-x[i])^2 + (y-y[i])^2) < minDistance # pythagorean theorem
distance[i] <- FALSE # distance to oneself is always zero
if(any(distance)) { # if too close to any other point
x <- x[-i] # remove element from x
y <- y[-i] # and remove element from y
} else { # otherwise...
i = i + 1 # repeat the procedure with the next element
}
if (i > length(x)) break
}
data.frame(x,y)
}
minimumDistancePairs(
c(0,3.9,4.1,8)
, c(1,4.1,3.9,7)
, 1
)
will lead to
x y
1 0.0 1.0
2 4.1 3.9
3 8.0 7.0
Be aware, though, of the fact that these are not random numbers anymore (however you solve problem).
You can use rejection sapling https://en.wikipedia.org/wiki/Rejection_sampling
The principle is simple: you resample until you data verify the condition.
> set.seed(1)
>
> x <- rnorm(2)
> y <- rnorm(2)
> (x[1]-x[2])^2+(y[1]-y[2])^2
[1] 6.565578
> while((x[1]-x[2])^2+(y[1]-y[2])^2 > 1) {
+ x <- rnorm(2)
+ y <- rnorm(2)
+ }
> (x[1]-x[2])^2+(y[1]-y[2])^2
[1] 0.9733252
>
The following is a naive hit-and-miss approach which for some choices of parameters (which were left unspecified in the question) works well. If performance becomes an issue, you could experiment with the package gpuR which has a GPU-accelerated distance matrix calculation.
rand.separated <- function(n,x0,x1,y0,y1,d,trials = 1000){
for(i in 1:trials){
nums <- cbind(runif(n,x0,x1),runif(n,y0,y1))
if(min(dist(nums)) >= d) return(nums)
}
return(NA) #no luck
}
This repeatedly draws samples of size n in [x0,x1]x[y0,y1] and then throws the sample away if it doesn't satisfy. As a safety, trials guards against an infinite loop. If solutions are hard to find or n is large you might need to increase or decrease trials.
For example:
> set.seed(2018)
> nums <- rand.separated(25,0,10,0,10,0.2)
> plot(nums)
runs almost instantly and produces:
Im not sure what you are asking.
if you want random coordinates here.
c(
runif(1,max=y[1],min=x[1]),
runif(1,max=y[2],min=x[2]),
runif(1,min=y[3],max=x[3]),
runif(1,min=y[4],max=x[4])
)

R: Cluster analysis with hclust(). How to get the cluster representatives?

I am doing some cluster analysis with R. I am using the hclust() function and I would like to get, after I perform the cluster analysis, the cluster representative of each cluster.
I define a cluster representative as the instances which are closest to the centroid of the cluster.
So the steps are:
Finding the centroid of the clusters
Finding the cluster representatives
I have already asked a similar question but using K-means: https://stats.stackexchange.com/questions/251987/cluster-analysis-with-k-means-how-to-get-the-cluster-representatives
The problem, in this case, is that hclust doesn't give the centroids!
For example, saying that d are my data, what I have done so far is:
hclust.fit1 <- hclust(d, method="single")
groups1 <- cutree(hclust.fit1, k=3) # cut tree into 3 clusters
## getting centroids ##
mycentroid <- colMeans(CV)
clust.centroid = function(i, dat, groups1) {
ind = (groups1 == i)
colMeans(dat[ind,])
}
centroids <- sapply(unique(groups1), clust.centroid, data, groups1)
But now, I was trying to get the cluster representatives with this code (I got it in the other question I asked, for k-means):
index <- c()
for (i in 1:3){
rowsum <- rowSums(abs(CV[which(centroids==i),1:3] - centroids[i,]))
index[i] <- as.numeric(names(which.min(rowsum)))
}
And it says that:
"Error in e2[[j]] : index out of the limit"
I would be grateful if any of you could give me a little help. Thanks.
-- (not) Working example of the code --
example_data.txt
A,B,C
10.761719,5.452188,7.575762
10.830457,5.158822,7.661588
10.75391,5.500170,7.740330
10.686719,5.286823,7.748297
10.864527,4.883244,7.628730
10.701415,5.345650,7.576218
10.820583,5.151544,7.707404
10.877528,4.786888,7.858234
10.712337,4.744053,7.796390
As for the code:
# Install R packages
#install.packages("fpc")
#install.packages("cluster")
#install.packages("rgl")
library(fpc)
library(cluster)
library(rgl)
CV <- read.csv("example_data")
str(CV)
data <- scale(CV)
d <- dist(data,method = "euclidean")
hclust.fit1 <- hclust(d, method="single")
groups1 <- cutree(hclust.fit1, k=3) # cut tree into 3 clusters
mycentroid <- colMeans(CV)
clust.centroid = function(i, dat, groups1) {
ind = (groups1 == i)
colMeans(dat[ind,])
}
centroids <- sapply(unique(groups1), clust.centroid, CV, groups1)
index <- c()
for (i in 1:3){
rowsum <- rowSums(abs(CV[which(centroids==i),1:3] - centroids[i,]))
index[i] <- as.numeric(names(which.min(rowsum)))
}
Hierarchical clustering does not use (or compute) representatives.
In particular for single link (but it can also happen for other linkages), the "center" can be in a different cluster. Just consider the top two data sets in example:
Furthermore, the centroid (mean) is connected to Euclidean distance. With other distances, it may be a very bad representative.
So use with care!
Either way, hierarchical clustering does not define or compute a representative. You will have to do this yourself.

Resources