How to decide best number of clusters for kamila clustering with R? - r

I have a mixed type data set, so I wanted to try kamila clustering. It is easy to apply it, but I would like a plot to decide the number of clusters similar to knee-plot.
data <- read.csv("binarymat.csv",header=FALSE,sep=";")
conInd <- c(9)
conVars <- data[,conInd]
conVars <- data.frame(scale(conVars))
catVarsFac <- data[,c(1,2,3,4,5,6,7,8)]
catVarsFac[] <- lapply(catVarsFac, factor)
catVarsDum <- dummyCodeFactorDf(catVarsFac)
kamRes <- kamila(conVars, catVarsFac, numClust=5, numInit=10,
calcNumClust = "ps",numPredStrCvRun = 10, predStrThresh = 0.5)
summary(kamRes)
It says that the best number of clusters is 5. How does it decide that and can I see a plot indicating this?

In the kamila package documentation
Setting calcNumClust to ’ps’ uses the prediction strength method of
Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005).
There is no perfect method for estimating the number of clusters; PS
tends to give a smaller number than, say, BIC based methods for large
sample sizes.
In the case, you are using it, you have specified only one value to numClust. So, it doesn't look like you are actually selecting the number of clusters - you have already picked one.
To select the number of clusters, you have to specify the range you are interested in, for example, numClust = 2 : 7 and also the method for selecting the number of clusters.
If you also want to select the number of clusters, something like the following might work.
kamRes <- kamila(conVars, catVarsFac, numClust = 2 : 7, numInit = 10,
calcNumClust = "ps", numPredStrCvRun = 10, predStrThresh = 0.5)
Information on the selection of the number of clusters is now present in
kamRes$nClust, and plot(2:7, kamRes$nClust$psValues) could be what you are after.

Related

Mclust() - NAs in model selection

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true
I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

Simulation with N trials in R

I am trying to create a simulation where a number 0:100 is chosen by a person, then a random number 0:100 is generated using sample(). The difference between their chosen number and the random number is calculated and stored. I would like to use a for loop to run this 10000 times and store the results in a vector so I can later plot the results. Can anyone point me to where I can read about this or see some examples? Below is what I have so far but I keep getting errors saying the lengths aren't the same multiple.
N = 10000
chosen.number = 0:100
generated.number = sample(0:100, N, replace = T)
differences = numeric(0)
for(i in 1:length(chosen.number)){
differences = (generated.number - chosen.number)
}
Then I'll make a scatterplot of the vector differences.
Here's an example of how you could go about it (if I understand your questions correctly).
You can set how many loops you want using Repeat.
Since you want a different randomly generated number each time, you'll have to put sample() within your loop. I didn't know where your user-selected number would come from, but in this example, it gets randomly generated with the same set of criteria as the random selection.
Then differences are collected in collect_differences for you to use downstream.
Repeat = 10 # Number of times to repeat/loop
collect_differences <- NULL
for(i in 1:Repeat){
randomly.generated.number = sample(0:100, size = 1, replace = T)
selected.number = sample(0:100, size = 1, replace = T)
differences = randomly.generated.number - selected.number
collect_differences = c(collect_differences, differences)
}
collect_differences
As for resources, you can look up anything related to the fundamentals of looping. You could also look through The Carpentries lessons in R as they have some resources for this as well.

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

MPC K-means constraints definiton using conclust package in R

I'm using the conclust package in R to perform semi-supervised clustering using MPC k-means algorithm to cluster fuel stations based on their activities.
I started with the code below.
mustLink =list(c ('station x','stations y'))
cantLink = list(c('station z','station w'))
mpckm(subset, 5, mustLink, cantLink, maxIter = 10)
subset is my dataframe.
stations x, stations y, station z and station w represents the row index.
My problem is I'm not sure how to define my constraints.
For now I'm begining with simple constraints like for example I don't want station X to be in the same cluster with station Y.
In the conclust package, the mpckm function takes two lists of must-link and cannot-link constraints but no further details are added.
I tried to do the same thing adding the row index of the stations in the constraints lists. but this didn't work throwing this error:
Error in 1:nm : argument of length 0.
What Am I exactly missing ?
It works well for me working with objects of class matrix. Apart from the example of the help, matrices are also considered.
So you should do something similar to (assuming datasetis a data.frame):
mustLink =
cbind(which(rownames(subset)=='station x'),which(rownames(subset)=='stations y'))
cantLink =
cbind(which(rownames(subset)=='station z'),which(rownames(subset)=='station w'))
mpckm(subset, 5, mustLink, cantLink, maxIter = 10)

how to create a random loss sample in r using if function

I am working currently on generating some random data for a school project.
I have created a variable in R using a binomial distribution to determine if an observation had a loss yes=1 or not=0.
Afterwards I am trying to generate the loss amount using a random distribution for all observations which already had a loss (=1).
As my loss amount is a percentage it can be anywhere between 0
What Is The Intuition Behind Beta Distribution # stats.stackexchange
In a third step I am looking for an if statement, which combines my two variables.
Please find below my code (which is only working for the Loss_Y_N variable):
Loss_Y_N = rbinom(1000000,1,0.01)
Loss_Amount = dbeta(x, 10, 990, ncp = 0, log = FALSE)
ideally I can combine the two into something like
if(Loss_Y_N=1 then Loss_Amount=dbeta(...) #... is meant to be a random variable with mean=0.15 and should be 0<x=<1
else Loss_Amount=0)
Any input highly appreciated!
Create a vector for your loss proportion. Fill up the elements corresponding to losses with draws from the beta. Tweak the parameters for the beta until you get the desired result.
N <- 100000
loss_indicator <- rbinom(N, 1, 0.1)
loss_prop <- numeric(N)
loss_prop[loss_indicator > 0] <- rbeta(sum(loss_indicator), 10, 990)

Resources