Randomizations and hierarchical tree - r

I am trying to permute (column-wise only) my data matrix a 1000 times and then do hierarchical clustering in "R" so I have the final tree on my data after 1000 randomizations.
This is where I am lost. I have this loop
for(i in 1:1000)
{
permuted <- test2_matrix[,sample(ncol(test2_matrix), 12, replace=TRUE)]; (this permutes my columns)
d = dist(permuted, method = "euclidean", diag = FALSE, upper = FALSE, p = 2);
clust = hclust(d, method = "complete", members=NULL);
}
png (filename="cluster_dendrogram_bootstrap.png", width=1024, height=1024, pointsize=10)
plot(clust)
I am not sure if the final tree is a product after the 1000 randomizations or just the last tree that it calculated in the loop. Also If I want to display the bootstrap values on the tree how should I go about it?
Many thanks!!

The value of clust in your example is indeed the final tree calculated in the loop. Here's a way of making and saving 1000 permutations of your matrix
make.permuted.clust <- function(i){ # this argument is not used
permuted <- data.matrix[,sample(ncol(data.matrix), 12, replace=TRUE)]
d <- dist(permuted, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
clust <- hclust(d, method = "complete", members=NULL)
clust # return value
}
all.clust <- lapply(1:1000, make.permuted.clust) # 1000 hclust trees
The second part of your question should be answered here.

You may be interested in the RandomForest method implemented in the randomForest package, which implements both bootstrapping of the data and of the splitting variables and allows you to save trees and get a consensus tree.
library(randomForest)
The original random forest (in FORTRAN 77) developers site
The package manual

Related

Simulate a list of kppm objects in R spatstat

I would like to use envelope simulation on a list of kppm objects.
I have a point dataset that is in a grid pattern. I would like to shift each point such that its position in the cell is random, then run a spatstat simulation to determine if the points are randomly distributed (obviously will be at sub cell size scale). Then randonly shift shift each point inside the cell, and re-run the simulation - repeating n times and generate an envelope of all the simulations. There are 2 ways to do this 1) using a list of point sets and 2) using simulate.
Method 1) - on a list called my_kppm_list
Lmat = envelope(fitM, Lest, nsim = 19, global=FALSE, simulate=my_kppm_list)
plot(Lmat)
How to create a list of cluster process models (kppm)?
Naive way:
my_list <- list(kppm_0, kppm_1)
This fails when trying to run simulation:
Lmat = envelope(fitM, Lest, nsim = 19, global=FALSE, simulate=my_list)
Error in envelopeEngine(X = X, fun = fun, simul = simrecipe, nsim = nsim, :
‘simulate’ should be an expression, or a list of point patterns of the same kind as X
I can convert a list to .ppm
fitM_0 <- kppm(create_pts(), ~1, "MatClust")
fitM_1 <- kppm(create_pts(), ~1, "MatClust")
my_list <- list(fitM_0, fitM_1)
ppm_list <- lapply(my_list, as.ppm)
But trying to convert to kppm fails with an error
kppm_list <- lapply(my_list, as.kppm)
Method 2) apply a function in simulation such that a random shift is applied to each point then simulation run, and envelope of all simulations is used (see page 399 of Baddelley et al. Spatial Point Patterns book (2016)):
e_rand <- function(){
j_x <- matrix(unlist(runif(dim(c_df)[1], min=-10, max=10)), ncol = 1, byrow = T)
j_y <- matrix(unlist(runif(dim(c_df)[1], min=-10, max=10)), ncol = 1, byrow = T)
x_j<- c_df[,1]+j_x
y_j<- c_df[,2]+j_y
c_j <- ppp(x = x_j, y = y_j, window = window)
return(c_j)
}
Lmat = envelope(fitM, Lest, nsim = 19, global=TRUE, simulate=e_rand)
However I found that the null model (red dashed line in output plot) had kinks in it when simulate is added - kinks that do not appear without simulate.
How to create a list of cluster process models (kppm)?
This is not the problem. In your example you are successfully creating a list of objects of class kppm. If fit1 and fit2 are fitted models of class kppm, then
m <- list(fit1, fit2)
is a list of objects of class kppm.
The problem is that you then pass this list as an argument to the function envelope which does not accept this format. The error message from envelope.ppp or envelope.kppm says that the argument simulate must be either a fitted model, or a list of point patterns.
An envelope is constructed by generating simulated point patterns, computing a summary function for each simulated pattern, and computing the upper and lower extremes of these summary functions. The argument simulate (if it is given) is a recipe for generating the simulated point patterns. If simulate is an expression like expression(runifpoint(42)) then this expression will be evaluated nsim times to produce nsim different point patterns. If simulate is a fitted model, then nsim simulated realisations of the model will be generated. If simulate is a list of point patterns, then they will be taken as the simulated random patterns.
It is unclear what you want to do with your list of models.
Do you want to construct a single envelope, or a list of envelopes?
Option 1: you have a list m of models of class kppm. For each of these models m[[i]], you want to construct an envelope e[[i]], where the limits are determinedby simulation from m[[i]]`.
Option 2: you have a list m of models of class kppm. For each model m[[i]] you want to generate one point pattern, say X[[i]], and build an envelope e using these patterns.
For option 1, type something like
e <- anylapply(m, function(fit) {
envelope(Y, Lest, nsim = 19, global=FALSE, simulate=fit)})
For option 2,
X <- solapply(m, simulate, nsim=1, drop=TRUE)
e <- envelope(Y, Lest, nsim=19, global=FALSE, simulate=X)
If you wanted something else, please clarify.

How to get gap statistic for hierarchical average clustering

I perform a hierarchical cluster analysis based on 'average linkage' In base r, I use
dist_mat <- dist(cdata, method = "euclidean")
hclust_avg <- hclust(dist_mat, method = "average")
I want to calculate the gap statistics to decide optimal number of clusters. I use the 'cluster' library and the clusGap function. Since I can't pass the hclust solution nor specify average hiearchical clustering in the clusGap function, I use these lines:
cluster_fun <- function(x, k) list(cluster = cutree(hclust(dist(x, method = "euclidean"), method="average"), k = k))
gap_stat <- clusGap(cdata, FUN=cluster_fun, K.max=10, B=50)
print(gap_stat)
However, here I can't check the cluster solution. So, my question is - can I be sure that the gap statistic is calculated on the same solution as hclust_avg?
Is there a better way of doing this?
Yes it should be the same. In the clusGap function, it calls the cluster_fun for each k you provided, then calculates the pooled within cluster sum of squares around, as described in the paper
This is the bit of code called inside clusGap that calls your custom function:
W.k <- function(X, kk) {
clus <- if (kk > 1)
FUNcluster(X, kk, ...)$cluster
else rep.int(1L, nrow(X))
0.5 * sum(vapply(split(ii, clus), function(I) {
xs <- X[I, , drop = FALSE]
sum(dist(xs)^d.power/nrow(xs))
}, 0))
}
And from here, the gap statistics is calculated.
You can calculate the gap statistic using some custom code, but for the sake of reproducibility, etc, it might be easier to use this?
Thanhs for solving it. I must say this is good enough solution but you can try below given code as well.
# Gap Statistic for K means
def optimalK(data, nrefs=3, maxClusters=15):
"""
Calculates KMeans optimal K using Gap Statistic
Params:
data: ndarry of shape (n_samples, n_features)
nrefs: number of sample reference datasets to create
maxClusters: Maximum number of clusters to test for
Returns: (gaps, optimalK)
"""
gaps = np.zeros((len(range(1, maxClusters)),))
resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
for gap_index, k in enumerate(range(1, maxClusters)):
# Holder for reference dispersion results
refDisps = np.zeros(nrefs)
# For n references, generate random sample and perform kmeans getting resulting dispersion of each loop
for i in range(nrefs):
# Create new random reference set
randomReference = np.random.random_sample(size=data.shape)
# Fit to it
km = KMeans(k)
km.fit(randomReference)
refDisp = km.inertia_
refDisps[i] = refDisp
# Fit cluster to original data and create dispersion
km = KMeans(k)
km.fit(data)
origDisp = km.inertia_
# Calculate gap statistic
gap = np.log(np.mean(refDisps)) - np.log(origDisp)
# Assign this loop's gap statistic to gaps
gaps[gap_index] = gap
resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)
return (gaps.argmax() + 1, resultsdf)
score_g, df = optimalK(cluster_df, nrefs=5, maxClusters=30)
plt.plot(df['clusterCount'], df['gap'], linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('Gap Statistic');
plt.title('Gap Statistic vs. K');

How to plot per tree ROC curves from randomForest in R?

I know that randomForest is supposed to be a black box, and that most people are interested in the ROC curve of the classifier as a whole, but I'm working on a problem in which I need to inspect individual trees of RF. I'm not very experienced with R so what's an easy way to plot ROC curves for the individual trees generated by RF?
I don't think you can generate a ROC curve from a single tree from a random forest generated by the randomForest package. You can access the output of each tree from a prediction, for example over the training set.
# caret for an example data set
library(caret)
library(randomForest)
data(GermanCredit)
# use only 50 rows for demonstration
nrows = 50
# extract the first 9 columns and 50 rows as training data (column 10 is "Class", the target)
x = GermanCredit[1:nrows, 1:9]
y = GermanCredit$Class[1:nrows]
# build the model
rf_model = randomForest(x = x, y = y, ntree = 11)
# Compute the prediction over the training data. Note predict.all = TRUE
rf_pred = predict(rf_model, newdata = x, predict.all = TRUE, type = "prob")
You can access the predictions of each tree with
rf_pred$individual
However, the prediction of a single tree is only the most likely label. For a ROC curve you need class probabilities, so that changing the decision threshold changes the predicted class to vary true and false positive rates.
As far as I can tell, at least in package randomForest there is no way to make the leaves output probabilities instead of labels. If you inspect a tree with getTree(), you will see that the prediction is binary; use getTree(rf_model, k = 1, labelVar = TRUE) and you'll see the labels in plain text.
What you can do, though, is to retrieve individual predictions via predict.all = TRUE and then manually compute class labels on subsets of the whole forest. This you can then input into a function to compute ROC curves like those from the ROCR package.
Edit: Ok, from the link you provided in your comment I got the idea how a ROC curve can be obtained. First, we need to extract one particular tree and then input each data point into the tree, in order to count the occurances of the success class at each node as well as total data points in each node. The ratio gives the node probability for success class. Next, we do something similar, i.e. input each data point into the tree, but now record the probability. This way we can compare the class probs with the true label.
Here is the code:
# libraries we need
library(randomForest)
library(ROCR)
# Set fixed seed for reproducibility
set.seed(54321)
# Define function to read out output node of a tree for a given data point
travelTree = function(tree, data_row) {
node = 1
while (tree[node, "status"] != -1) {
split_value = data_row[, tree[node, "split var"]]
if (tree[node, "split point"] > split_value ) {
node = tree[node, "right daughter"]
} else {
node = tree[node, "left daughter"]
}
}
return(node)
}
# define number of data rows
nrows = 100
ntree = 11
# load example data
data(GermanCredit)
# Easier access of variables
x = GermanCredit[1:nrows, 1:9]
y = GermanCredit$Class[1:nrows]
# Build RF model
rf_model = randomForest(x = x, y = y, ntree = ntree, nodesize = 10)
# Extract single tree and add variables we need to compute class probs
single_tree = getTree(rf_model, k = 2, labelVar = TRUE)
single_tree$"split var" = as.character(single_tree$"split var")
single_tree$sum_good = 0
single_tree$sum = 0
single_tree$pred_prob = 0
for (zeile in 1:nrow(x)) {
out_node = travelTree(single_tree, x[zeile, ])
single_tree$sum_good[out_node] = single_tree$sum_good[out_node] + (y[zeile] == "Good")
single_tree$sum[out_node] = single_tree$sum[out_node] + 1
}
# Compute class probabilities from count of "Good" data points in each node.
# Make sure we do not divide by zero
idcs = single_tree$sum != 0
single_tree$pred_prob[idcs] = single_tree$sum_good[idcs] / single_tree$sum[idcs]
# Compute prediction by inserting again data set into tree, but read out
# previously computed probs
single_tree_pred = rep(0, nrow(x))
for (zeile in 1:nrow(x)) {
out_node = travelTree(single_tree, x[zeile, ])
single_tree_pred[zeile] = single_tree$pred_prob[out_node]
}
# Et voila: The ROC curve for single tree!
plot(performance(prediction(single_tree_pred, y), "tpr", "fpr"))

Hierarchical Cluster using dissimilarity matrix R

I have mixed data type matrix Data_string size (947 x 41) that contain numeric and categorical attributes.
I produced a distance matrix (947 x 947) using the daisy() function and Gower distance measure in Rstudio.
d <- daisy(Data_String, metric = "gower", stand = FALSE,type = list(symm = c("V1","V13") , asymm = c("V8","V9","V10")))
I applied hierarchical Cluster using dissimilarity matrix (d).
# hclust
hc <- hclust(d, method="complete")
plot(hc)
rect.hclust(hc, 4)
cut <- cutree(hc, k = 1:5)
View(cut)
#Diana
d_as <- as.matrix(d)
DianaCluster <- diana(d_as, diss = TRUE, keep.diss = TRUE)
print(DianaCluster)
plot(DianaCluster)
The following is the plots I had.
** Note: I couldn't upload the image here since I do not have enough reputation's points.
I am struggling to understand the results, can anyone please
1- suggest any solution that I can apply in R to simplify the understanding of my results.
or
2- how I can link it to my source data, since all the results are based on the dissimilarity matrix.
Please take a look at -
https://stats.stackexchange.com/questions/130974/how-to-use-both-binary-and-continuous-variables-together-in-clustering
It explains how to use gower dissimilarity matrix with hclust. Hope this helps!

Topic models: cross validation with loglikelihood or perplexity

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60.
I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to calculate perplexity or log likelihood for the holdout set.
I found this code from one of CV's discussion sessions. I really don't understand several lines of code below. I have dtm matrix using the holdout set (20 documents). But I don't know how to calculate the perplexity or log likelihood of this holdout set.
Questions:
Can anybody explain to me what seq(2, 100, by =1) mean here? Also, what AssociatedPress[21:30] mean? What function(k) is doing here?
best.model <- lapply(seq(2, 100, by=1), function(k){ LDA(AssociatedPress[21:30,], k) })
If I want to calculate perplexity or log likelihood of the holdout set called dtm, is there better code? I know there are perplexity() and logLik() functions but since I'm new I can not figure out how to implement it with my holdout matrix, called dtm.
How can I do ten fold cross validation with my corpus, containing 200 documents? Is there existing code that I can invoke? I found caret for this purpose, but again cannot figure that out either.
The accepted answer to this question is good as far as it goes, but it doesn't actually address how to estimate perplexity on a validation dataset and how to use cross-validation.
Using perplexity for simple validation
Perplexity is a measure of how well a probability model fits a new set of data. In the topicmodels R package it is simple to fit with the perplexity function, which takes as arguments a previously fit topic model and a new set of data, and returns a single number. The lower the better.
For example, splitting the AssociatedPress data into a training set (75% of the rows) and a validation set (25% of the rows):
# load up some R packages including a few we'll need later
library(topicmodels)
library(doParallel)
library(ggplot2)
library(scales)
data("AssociatedPress", package = "topicmodels")
burnin = 1000
iter = 1000
keep = 50
full_data <- AssociatedPress
n <- nrow(full_data)
#-----------validation--------
k <- 5
splitter <- sample(1:n, round(n * 0.75))
train_set <- full_data[splitter, ]
valid_set <- full_data[-splitter, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
perplexity(fitted, newdata = train_set) # about 2700
perplexity(fitted, newdata = valid_set) # about 4300
The perplexity is higher for the validation set than the training set, because the topics have been optimised based on the training set.
Using perplexity and cross-validation to determine a good number of topics
The extension of this idea to cross-validation is straightforward. Divide the data into different subsets (say 5), and each subset gets one turn as the validation set and four turns as part of the training set. However, it's really computationally intensive, particularly when trying out the larger numbers of topics.
You might be able to use caret to do this, but I suspect it doesn't handle topic modelling yet. In any case, it's the sort of thing I prefer to do myself to be sure I understand what's going on.
The code below, even with parallel processing on 7 logical CPUs, took 3.5 hours to run on my laptop:
#----------------5-fold cross-validation, different numbers of topics----------------
# set up a cluster for parallel processing
cluster <- makeCluster(detectCores(logical = TRUE) - 1) # leave one CPU spare...
registerDoParallel(cluster)
# load up the needed R package on all the parallel sessions
clusterEvalQ(cluster, {
library(topicmodels)
})
folds <- 5
splitfolds <- sample(1:folds, n, replace = TRUE)
candidate_k <- c(2, 3, 4, 5, 10, 20, 30, 40, 50, 75, 100, 200, 300) # candidates for how many topics
# export all the needed R objects to the parallel sessions
clusterExport(cluster, c("full_data", "burnin", "iter", "keep", "splitfolds", "folds", "candidate_k"))
# we parallelize by the different number of topics. A processor is allocated a value
# of k, and does the cross-validation serially. This is because it is assumed there
# are more candidate values of k than there are cross-validation folds, hence it
# will be more efficient to parallelise
system.time({
results <- foreach(j = 1:length(candidate_k), .combine = rbind) %dopar%{
k <- candidate_k[j]
results_1k <- matrix(0, nrow = folds, ncol = 2)
colnames(results_1k) <- c("k", "perplexity")
for(i in 1:folds){
train_set <- full_data[splitfolds != i , ]
valid_set <- full_data[splitfolds == i, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
results_1k[i,] <- c(k, perplexity(fitted, newdata = valid_set))
}
return(results_1k)
}
})
stopCluster(cluster)
results_df <- as.data.frame(results)
ggplot(results_df, aes(x = k, y = perplexity)) +
geom_point() +
geom_smooth(se = FALSE) +
ggtitle("5-fold cross-validation of topic modelling with the 'Associated Press' dataset",
"(ie five different models fit for each candidate number of topics)") +
labs(x = "Candidate number of topics", y = "Perplexity when fitting the trained model to the hold-out set")
We see in the results that 200 topics is too many and has some over-fitting, and 50 is too few. Of the numbers of topics tried, 100 is the best, with the lowest average perplexity on the five different hold-out sets.
I wrote the answer on CV that you refer to, here's a bit more detail:
seq(2, 100, by =1) simply creates a number sequence from 2 to 100 by ones, so 2, 3, 4, 5, ... 100. Those are the numbers of topics that I want to use in the models. One model with 2 topics, another with 3 topics, another with 4 topics and so on to 100 topics.
AssociatedPress[21:30] is simply a subset of the built-in data in the topicmodels package. I just used a subset in that example so that it would run faster.
Regarding the general question of optimal topic numbers, I now follow the example of Martin
Ponweiser on Model Selection by Harmonic Mean (4.3.3 in his thesis, which is here: http://epub.wu.ac.at/3558/1/main.pdf). Here's how I do it at the moment:
library(topicmodels)
#
# get some of the example data that's bundled with the package
#
data("AssociatedPress", package = "topicmodels")
harmonicMean <- function(logLikelihoods, precision=2000L) {
library("Rmpfr")
llMed <- median(logLikelihoods)
as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
prec = precision) + llMed))))
}
# The log-likelihood values are then determined by first fitting the model using for example
k = 20
burnin = 1000
iter = 1000
keep = 50
fitted <- LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) )
# where keep indicates that every keep iteration the log-likelihood is evaluated and stored. This returns all log-likelihood values including burnin, i.e., these need to be omitted before calculating the harmonic mean:
logLiks <- fitted#logLiks[-c(1:(burnin/keep))]
# assuming that burnin is a multiple of keep and
harmonicMean(logLiks)
So to do this over a sequence of topic models with different numbers of topics...
# generate numerous topic models with different numbers of topics
sequ <- seq(2, 50, 1) # in this case a sequence of numbers from 1 to 50, by ones.
fitted_many <- lapply(sequ, function(k) LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))
# extract logliks from each topic
logLiks_many <- lapply(fitted_many, function(L) L#logLiks[-c(1:(burnin/keep))])
# compute harmonic means
hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))
# inspect
plot(sequ, hm_many, type = "l")
# compute optimum number of topics
sequ[which.max(hm_many)]
## 6
Here's the output, with numbers of topics along the x-axis, indicating that 6 topics is optimum.
Cross-validation of topic models is pretty well documented in the docs that come with the package, see here for example: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf Give that a try and then come back with a more specific question about coding CV with topic models.

Resources