K-means iterated for same data for 10 times - r

I am a fresher to R. Trying to evaluate if I can get an optimization of K-means (using R) by iteratively calling the k-means routine for same dataset and same value for K (i.e. k=3 in my case) of 10/15 times and see if if can give me good results. I see the clustering changes at every call, even the total sum of squares and withinss starts changing but not sure how to halt at the best situation.
Can anyone guide me?
code:
run_kmeans <- function(xtimes)
{
for (x in 1:xtimes)
{
kmeans_results <- kmeans(filtered_data, 3)
print(kmeans_results["totss"])
print(kmeans_results["tot.withinss"])
}
return(kmeans_results)
}
kmeans_results = run_kmeans(10)

Not sure I understood your question because this is not the usual way of selecting the best partition (elbow method, silhouette method, etc.)
Let's say you want to find the kmeans partition that minimizes your within-cluster sum of squares.
Let's take the example from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
You could write that to run repetitively kmeans:
xtimes <- 10
kmeans <- lapply(seq_len(xtimes), function(i){
kmeans_results <- kmeans(x, 3)
})
lapply is always preferrable to for. You output a list. To extract withinss and see which one is minimal:
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)
However, unless I misunderstood your objective, this is a strange way to select the most performing partition. Usually, this is the number of clusters that is evaluated ; not different partititons produced with the same sample data and the same number of clusters.
Edit from your comment
Ok, so you want to find the combination of columns that give you the best performance. I give you an example below where every two by two combinations of three variables is tested. You could generalize a little bit (but the number of combinations possible with 8 variables is very big, you should have a routine to reduce the number of tested combinations)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 3),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 3)
)
colnames(x) <- c("x", "y","z")
combinations <- combn(colnames(x), 2, simplify = FALSE)
kmeans <- lapply(combinations, function(i){
kmeans_results <- kmeans(x[,i], 3)
})
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

Clustering ranking

I'm analyzing a data in R where predictor variables are available but there is no response variable. Using unsupervised learning (k-means) I have identified patterns in the data. But I need to rank the clusters according to their overall performance (example: student's data on exam marks and co-curricular marks). What technique do I use after clustering in R?
The cluster attribute of the kmeans output gives you the index of which cluster each data point is in. Example data taken from kmeans documentation:
nclusters = 5
# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
cl <- kmeans(x, nclusters, nstart = 25)
Now, your evaluation function (e.g. mean of column values) can be applied to each cluster individually:
for (i in 1:nclusters){
cat(i, apply(x[which(cl$cluster==i),],MARGIN=2,FUN=mean), '\n')
}
Or better still, use some kind of aggregation function, e.g. tapply or aggregate, e.g.:
aggregate(x, by=list(cluster=cl$cluster), FUN=mean)
which gives
cluster x y
1 1 1.2468266 1.1499059
2 2 -0.2787117 0.0958023
3 3 0.5360855 1.0217910
4 4 1.0997776 0.7175210
5 5 0.2472313 -0.1193551
At this point you should be able to rank the values of the aggregation function as needed.

R: Applying function to every element of matrix using elements of different matrix as function input

I wish to apply a custom function to each element of a matrix whilst also using elements of a different matrix as inputs to the function.
Specifically, my function generates random samples from a von Mises distribution (circular normal distribution), calling the Rfast package's rvonmises function.
I have one matrix (radians) which records the angle I wish to use for the central tendency of the random generation (similar to the mean), and another matrix (kappa) which records the concentration parameter of the von Mises I wish to use (similar to standard deviation).
I wish to use (for example) element [1, 1] of the radians matrix together with element [1, 1] of the kappa matrix in a call to the von Mises random generator. So, my call for one element would be:
rvonmises(n = 1, m = radians[1, 1], k = kappa[1, 1])
But of course I want this applied across all elements of the matrices. (The rvonmises function doesn't accept multiple m or k values, so for example I couldn't use rvonmises(4, m = c(1, 2, 3, 4), k = c(1, 1.2, 1.4, 1.6)).)
To summarise: I am basically after a more principled (and faster!) way of doing this:
for(i in 1:nrow(radians)){
for(j in 1:ncol(radians)){
result[i, j] <- Rfast::rvonmises(1, radians[i, j], kappa[i, j])
}
}
What I have tried
Based on this post, I have tried to use mapply:
library(Rfast)
set.seed(42)
# random radians to use as input
radians <- matrix(data = runif(12, 0, 2 * pi),
ncol = 4)
# random concentration parameters of the von Mises distribution
kappa <- matrix(data = rgamma(12, 70, 30),
ncol = 4)
# function to generate random von mises sample with angle x and
# concentration parameter k
my_function <- function(m, k){
Rfast::rvonmises(1, m, k)
}
# my attempt
out <- matrix(mapply(my_function, m = as.data.frame(radians), k = kappa),
ncol = 4, byrow = TRUE)
However, I don't think this is working. For example, if I test it by the following (where the central tendency in test_radians increases steadily and I use large values for kappa which leads to precise estimates):
test_radians <- matrix(data = seq(from = 1, to = 2 * pi, length.out = 12),
ncol = 4)
test_kappa <- matrix(data = rep(20, times = 12),
ncol = 4)
test <- matrix(mapply(my_function, m = as.data.frame(test_radians),
k = test_kappa),
ncol = 4, byrow = TRUE)
test[1, 1] should be smaller (on average), and test[3, 4] should be largest. (I know due to random variability this won't always be the case, but I've tried it with many replications.)
So, the mapping and matching between matrices isn't working as I had anticipated.
Any guidance welcomed.
You cannot compute the mean of circular observations by simply calling "mean". This is wrong. The correct way is to compute the mean of the cosinus and sinus of the angles and then use the arc tangent. See pcakcges for directional or circular data for this.
Secondly, you gave us an idea, to return a matrix of von Mises generated data. But, since brms does this job for you, at the moment I would go there.

Hierarchical clustering for centers of kmeans in R

I have a huge data set (200,000 rows * 40 columns) where each row represents an observation and each column is a variable. For this data, I would like to do hierarchical clustering. Unfortunately, as the number of rows is huge, then it is impossible to do this using my computer since I need to compute the distance matrix for all pairs of observations so (200,000 * 200,000) matrix.
The answer of this question suggests to use first kmeans to calculate a number of centers, then to perform the hierarchical clustering on the coordinates of these centers using the library FactoMineR.
The problem: I keep getting an error when applying the same method!
#example
# Data
MyData <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
kClust_MyData <- kmeans(MyData, 1000, iter.max=20)
Hclust_MyData <- HCPC(kClust_MyData$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(Hclust_MyData, choice="tree")
But
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w = res.sauv$call$row.w.init) :
object 'data.clust' not found
The package fastcluster has a method hclust.vector that does not require a distance matrix as input, but computes the distances itself in a more memory efficient way. From the fastcluster manual:
The call
hclust.vector(X, method='single', metric=[...])
is equivalent to
hclust(dist(X, metric=[...]), method='single')
but uses less memory and is equally fast

knapsack case r implementation for multiple persons using genetic algorithm

I am trying to implement genetic algorithm in R. I found out that r has 'GA' and 'genalg' packages for genetic algorithm implementation. I encountered the example i the link http://www.r-bloggers.com/genetic-algorithms-a-simple-r-example/. They tried solving the Knapsack problem. The problem can be briefly explained as:
"You are going to spend a month in the wilderness. You’re taking a backpack with you, however, the maximum weight it can carry is 20 kilograms. You have a number of survival items available, each with its own number of 'survival points'. You’re objective is to maximize the number of survival points"
The problem is easily solved using 'genalg' package for a single person and the output is binary string. Now i have a doubt, lets say instead of one person there are 2 or more i.e multiple persons and we need to distribute the survival points. The weight constraints apply for each person. Then how can we solve this problem? Can we use 'genalg' or 'GA' package? If so how can we apply them? Are there any examples on this that are solved in R or other software's?
Thanks
The R package adagio (https://cran.r-project.org/web/packages/adagio/index.html) comes with two functions (knapsack and mknapsack) which solves this type of problem more efficient by dynamic programming.
A simple approach could be to have one chromosome containing all individuals in the group and have the evaluation function split this chromosome in multiple parts, one for each individual and then have these parts evaluated. In the example below (based on the example in the question) I have assumed each individual has the same weight limit and multiple individuals can bring the same item.
library(genalg)
#Set up the problem parameters
#how many people in the group
individual_count <-3
#The weight limit for one individual
weightlimit <- 20
#The items with their survivalpoints
dataset <- data.frame(item = c("pocketknife", "beans", "potatoes", "unions",
"sleeping bag", "rope", "compass"), survivalpoints = c(10, 20, 15, 2, 30,
10, 30), weight = c(1, 5, 10, 1, 7, 5, 1))
#Next, we choose the number of iterations, design and run the model.
iter <- 100
#Our chromosome has to be large enough to contain a bit for all individuals and for all items in the dataset
chromosomesize <- individual_count * nrow(dataset)
#Function definitions
#A function to split vector X in N equal parts
split_vector <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
#EValuate an individual (a part of the chromosome)
evalIndividual <- function(x) {
current_solution_survivalpoints <- x %*% dataset$survivalpoints
current_solution_weight <- x %*% dataset$weight
if (current_solution_weight > weightlimit)
return(0) else return(-current_solution_survivalpoints)
}
#Evaluate a chromosome
evalFunc <- function(x) {
#First split the chromosome in a list of individuals, then we can evaluate all individuals
individuals<-split_vector(x,individual_count)
#now we need to sapply the evalIndividual function to each element of individuals
return(sum(sapply(individuals,evalIndividual)))
}
#Run the Genetic Algorithm
GAmodel <- rbga.bin(size = chromosomesize, popSize = 200, iters = iter, mutationChance = 0.01,
elitism = T, evalFunc = evalFunc)
#First show a summary
summary(GAmodel,echo=TRUE)
#Then extract the best solution from the GAmodel, copy/paste from the source code of the summary function
filter = GAmodel$evaluations == min(GAmodel$evaluations)
bestSolution = GAmodel$population[filter, , drop= FALSE][1,]
#Now split the solution in the individuals.
split_vector(bestSolution,individual_count)

Resources