Context: I am trying to 1) define cluster based on certain individuals and 2) assign others individuals to the defined clusters.
What has been done: I use FactoMineR functions PCA() and HCPC() according to the workflow described by Husson, F., Josse, J., Pages, J., 2010. Principal component methods - hierarchical clustering - partitional clustering: why would we need to choose for visualizing data? Technical Report – Agrocampus 17.
Question: it is possible to assign each supplementary individual (PCA(..., ind.sup =***) a cluster defined in the Hierarchical Clustering analysis?
Similar question: this question has already been asked on stackoverflow here but it was 5 years ago and the answer does not fit into HCPC workflow.
Please fin below some code using base R dataset mtcars:
if(!require(FactoMineR)){install.packages("FactoMineR")}
library(FactoMineR)
if(!require(factoextra)){install.packages("factoextra")}
library(factoextra)
# lets use mtcars dataset for this question
head(mtcars)
# some individuals are considered as supplementary:
mtcars[22:nrow(mtcars),]
# HCPC workflow
res.pca = PCA(mtcars,
ind.sup = c(22:nrow(mtcars)), # the last 10 individuals are supplementary
scale.unit = TRUE,
ncp = 5,
graph = TRUE)
fviz_pca_ind(res.pca) # here supplementary individuals are included
res.hcpc = HCPC(res.pca,
nb.clust = -1, # automatic tree cut
min = 3,
max = NULL,
graph = FALSE,
kk=Inf) # no k-means pre-processing
# here are the two results needed to illustrate my question
fviz_dend(res.hcpc, show_labels = TRUE)
res.hcpc$desc.ind # notice that individuals considered as supplementary are not included in any cluster
Expected output: the expected output is something like this where supplementary individuals are assigned to cluster and clearly identified: . With Fiat X-9 and Lotus Europa (supplementary individuals) assigned to the cluster with Fiat 128 for example.
Related
I would like to simulate exponential family random graphs, and I just started learning to use the statnet and ergm R packages. From the tutorial I found online, I am able to learn an ERGM model from an example dataset:
# install.packages('statnet')
# install.packages('ergm')
# install.packages('coda')
library(statnet)
set.seed(123)
data(package='ergm') # tells us the datasets in our packages
data(florentine) # loads flomarriage and flobusiness data
# Triad model
flomodel <- ergm(flomarriage ~ edges + triangle)
summary(flomodel)
Currently, I would like to use the simulate command to simulate networks with a pre-specified number of nodes from a pre-specified formula (that is not learned from any particular dataset), for example, P(y) = 1/Z exp(a * num_edges + b * num_triangles), where a and b are user-specified coefficients.
How should I go about writing such a model in statnet?
You can simulate from a given formula with simulate (or simulate.formula):
simulate(flomarriage ~ edges + triangles, coef = c(3,1))
To fix a simulation to have the same number of edges as the given graph (flomarriage in this case)
simulate(flomarriage ~ edges + triangles, coef = c(3,1), constraints = ~edges)
Not every constraint you might want to apply is available since each requires a specific mcmc sampler, but for a list of what is available see ?ergm.constraints
To fix the simulation to have an arbitrary number of nodes and edges (not based on an observed data) a workaround is to create such a network first. For example, to simulate over networks with 17 nodes and 16 edges.
test.mat = matrix(0, 17, 17)
test.mat[1,] = 1 #adds 16 edges
test.net = as.network(test.mat, directed = F)
test.sim = simulate(test.net ~ triangles, coef = 1, constraints = ~edges)
summary.statistics(test.sim ~ edges() + triangles())
p.s. I don't recommend using the triangles term in ERGM models. The geometrically weighted terms (gwesp, gwdsp) are the best substitutes which are more stable.
In the R version of H2O, is it possible to specify a blocking factor when splitting data in training/validation/test sets and/or when doing cross-validation?
I'm working on a clinical dataset with multiple observations from the same patient that should be kept together during these operations.
If this is not possible to do within the H2O framework then suggestions on how to achieve this in R and integrate with H2O functions would be great.
Thanks!
When using H2O-3 with cross validation, you can tell the training algorithm which fold number an observation belongs to with the fold_column parameter. See:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/fold_column.html
The code example below (copied from the link above) shows folds being assigned randomly. But you could alternately write a piece of code to assign them specifically yourself.
library(h2o)
h2o.init()
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
# convert response column to a factor
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])
# set the predictor names and the response column name
predictors <- c("displacement","power","weight","acceleration","year")
response <- "economy_20mpg"
# create a fold column with 5 folds
# randomly assign fold numbers 0 through 4 for each row in the column
fold_numbers <- h2o.kfold_column(cars, nfolds=5)
# rename the column "fold_numbers"
names(fold_numbers) <- "fold_numbers"
# print the fold_assignment column
print(fold_numbers)
# append the fold_numbers column to the cars dataset
cars <- h2o.cbind(cars,fold_numbers)
# try using the fold_column parameter:
cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = cars,
fold_column="fold_numbers", seed = 1234)
# print the auc for your model
print(h2o.auc(cars_gbm, xval = TRUE))
I am trying implement hierarchical clustering in R : hclust() ; this requires a distance matrix created by dist() but my dataset has around a million rows, and even EC2 instances run out of RAM. Is there a workaround?
One possible solution for this is to sample your data, cluster the smaller sample, then treat the clustered sample as training data for k Nearest Neighbors and "classify" the rest of the data. Here is a quick example with 1.1M rows. I use a sample of 5000 points. The original data is not well-separated, but with only 1/220 of the data, the sample is separated. Since your question referred to hclust, I used that. But you could use other clustering algorithms like dbscan or mean shift.
## Generate data
set.seed(2017)
x = c(rnorm(250000, 0,0.9), rnorm(350000, 4,1), rnorm(500000, -5,1.1))
y = c(rnorm(250000, 0,0.9), rnorm(350000, 5.5,1), rnorm(500000, 5,1.1))
XY = data.frame(x,y)
Sample5K = sample(length(x), 5000) ## Downsample
## Cluster the sample
DM5K = dist(XY[Sample5K,])
HC5K = hclust(DM5K, method="single")
Groups = cutree(HC5K, 8)
Groups[Groups>4] = 4
plot(XY[Sample5K,], pch=20, col=rainbow(4, alpha=c(0.2,0.2,0.2,1))[Groups])
Now just assign all other points to the nearest cluster.
Core = which(Groups<4)
library(class)
knnClust = knn(XY[Sample5K[Core], ], XY, Groups[Core])
plot(XY, pch=20, col=rainbow(3, alpha=0.1)[knnClust])
A few quick notes.
Because I created the data, I knew to choose three clusters. With a real problem, you would have to do the work of figuring out an appropriate number of clusters.
Sampling 1/220 could completely miss any small clusters. In the small sample, they would just look like noise.
Good afternoon,
I am trying to perform Lo, Mendell and Rubin's (2001) adjusted test (LMR) in order to decide the optimal number of classes in LCA. I performed the command with poLCA, but I didn't find any command to perform it.
Is there someone that can help me?
Thank you very much!
Here is an example of a (ad-hoc adjusted) LMR test comparing a LCA with 3 groups (alternative model) against 2 groups (baseline model).
# load packages/install if needed
library(poLCA)
library(tidyLPA)
data("election")
# Fit LCA with 2 classes (NULL model)
mod_null <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 2, verbose = F)
# store values baseline model
n <- mod_null$Nobs #number of observations (should be equal in both models)
null_ll <- mod_null$llik #log-likelihood
null_param <- mod_null$npar # number of parameters
null_classes <- length(mod_null$P) # number of classes
# Fit LCA with 3 classes (ALTERNATIVE model)
mod_alt <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 3, verbose = F)
# Store values alternative model
alt_ll <- mod_alt$llik #log-likelihood
alt_param <- mod_alt$npar # number of parameters
alt_classes <- length(mod_alt$P) # number of classes
# use calc_lrt from tidyLPA package
calc_lrt(n, null_ll, null_param, null_classes, alt_ll, alt_param, alt_classes)
Wow really late to the game but as Im looking at similar things Ill leave for the next person.
The Lo-Mendell-Rubin test involves a transformation of the data and then a chi-sq test to determine if K classes is a better fit than K-1 classes... basically.
However there is reasonable research out there suggesting that a better measure of this is the bootstrap likelihood ratio.
The former is still in common use with MPlus users, the latter is far more common in LCA packages in R, e.g. mclust. Dunno about poLCA though...
I have already trained my clustering model using hclust:
model=hclust(distances,method="ward”)
And the result looks good:
Now I get some new data records, I want to predict which cluster every one of them belongs to. How do I get it done ?
Clustering is not supposed to "classify" new data, as the name suggests - it is the core concept of classification.
Some of the clustering algorithms (like those centroid based - kmeans, kmedians etc.) can "label" new instance based on the model created. Unfortunately hierarchical clustering is not one of them - it does not partition the input space, it just "connects" some of the objects given during clustering, so you cannot assign the new point to this model.
The only "solution" to use the hclust in order to "classify" is to create another classifier on top of the labeled data given by hclust. For example you can now train knn (even with k=1) on the data with labels from hclust and use it to assign labels to new points.
As already mentioned, you can use a classifier such as class :: knn, to determine which cluster a new individual belongs to.
The KNN or k-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labeled instances. More specifically, the distance between the stored data and the new instance is calculated by means of some kind of a similarity measure. This similarity measure is typically expressed by a distance measure such as the Euclidean distance.
Next I leave a code as an example for the iris data.
library(scorecard)
library(factoextra)
library(class)
df_iris <- split_df(iris, ratio = 0.75, seed = 123)
d_iris <- dist(scale(df_iris$train[,-5]))
hc_iris <- hclust(d_iris, method = "ward.D2")
fviz_dend(hc_iris, k = 3,cex = 0.5,k_colors = c("#00AFBB","#E7B800","#FC4E07"),
color_labels_by_k = TRUE, ggtheme = theme_minimal())
groups <- cutree(hc_iris, k = 3)
table(groups)
Predict new data
knnClust <- knn(train = df_iris$train[,-5], test = df_iris$test[,-5] , k = 1, cl = groups)
knnClust
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 3 3 3 2 2 2 2 2 3 3 2 2 3 2 2 2 2 2 2 2 2 2
Levels: 1 2 3
# p1 <- fviz_cluster(list(data = df_iris$train[,-5], cluster = groups), stand = F) + xlim(-11.2,-4.8) + ylim(-3,3) + ggtitle("train")
# p2 <- fviz_cluster(list(data = df_iris$test[,-5], cluster = knnClust),stand = F) + xlim(-11.2,-4.8) + ylim(-3,3) + ggtitle("test")
# gridExtra::grid.arrange(p1,p2,nrow = 2)
pca1 <- data.frame(prcomp(df_iris$train[,-5], scale. = T)$x[,1:2], cluster = as.factor(groups), factor = "train")
pca2 <- data.frame(prcomp(df_iris$test[,-5], scale. = T)$x[,1:2], cluster = as.factor(knnClust), factor = "test")
pca <- as.data.frame(rbind(pca1,pca2))
Plot train and test data
ggplot(pca, aes(x = PC1, y = PC2, color = cluster, size = 1, alpha = factor)) +
geom_point(shape = 19) + theme_bw()
You can use this classification and then use LDA to predict which class the new point should fall into.
I face the similar problem and work out a temporal solution.
In my environment R, the function hclust gives the label for the train data.
We can use one supervised learning model to reconnect label and features.
And then we just do the same data processing when we deal with a supervised learning model.
If we face a binary classification model, we can use KS value, AUC value and so on to see the performance of this clustering.
Similarly, we can use PCA method on the feature and extract PC1 as a label.
To binning this label, we get a new label fitted to classification.
In the same way, we do the same processing when we deal with a classification model.
In R, I find PCA method processes much faster than hclust. (Mayank 2016)
In practice, I find this way is easy to deploy the model.
But I suspect whether this temporal solution results in bias on prediction or not.
Ref
Mayank. 2016. “Hclust() in R on Large Datasets.” Stack Overflow. hclust() in R on large datasets.
Why not compute the centroid of the points for each hclust cluster, then assign a new point to the nearest using the same distance function ?
knn in class will only look at nearest n and only allows Euclidean distance.
There's no need to run a classifier.